[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-30 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152842#comment-14152842
 ] 

zhihai xu commented on YARN-2566:
-

Picking the directory with most available space is a good suggestion.  I will 
implement it in my new patch.
thanks

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004
 

[jira] [Commented] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.

2014-09-30 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153023#comment-14153023
 ] 

Remus Rusanu commented on YARN-2623:


Note that DCE also picks first local dir, DefaultContainerExecutor.java@99:

{code}
// TODO: Why pick first app dir. The same in LCE why not random?
Path appStorageDir = getFirstApplicationDir(localDirs, user, appId);
{code}

 Linux container executor only use the first local directory to copy token 
 file in container-executor.c.
 ---

 Key: YARN-2623
 URL: https://issues.apache.org/jira/browse/YARN-2623
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux container executor only use the first local 
 directory to copy token file in container-executor.c.
Reporter: zhihai xu
Assignee: zhihai xu

 Linux container executor only use the first local directory to copy token 
 file in container-executor.c. if It failed to copy token file to the first 
 local directory, the  localization failure event will happen. Even though it 
 can copy token file to the other local directory successfully. The correct 
 way should be to copy token file  to the next local directory  if the first 
 one failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED

2014-09-30 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153065#comment-14153065
 ] 

Tsuyoshi OZAWA commented on YARN-2545:
--

Thanks for your reporting, [~zhiguohong]. I think we should fix to report the 
states of app correctly. How about changing to check the state of apps and 
dispatch RMAppEventType#ATTEMPT_FAILED in 
RMAppAttemptImpl#AMUnregisteredTransition?


 RMApp should transit to FAILED when AM calls finishApplicationMaster with 
 FAILED
 

 Key: YARN-2545
 URL: https://issues.apache.org/jira/browse/YARN-2545
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor

 If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, 
 and then exits, the corresponding RMApp and RMAppAttempt transit to state 
 FINISHED.
 I think this is wrong and confusing. On RM WebUI, this application is 
 displayed as State=FINISHED, FinalStatus=FAILED, and is counted as Apps 
 Completed, not as Apps Failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153091#comment-14153091
 ] 

Hudson commented on YARN-1769:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #696 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/696/])
YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas 
Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.6.0

 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153086#comment-14153086
 ] 

Hudson commented on YARN-2606:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #696 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/696/])
YARN-2606. Application History Server tries to access hdfs before doing secure 
login (Mit Desai via jeagles) (jeagles: rev 
e10eeaabce2a21840cfd5899493c9d2d4fe2e322)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java
* hadoop-yarn-project/CHANGES.txt


 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Fix For: 2.6.0

 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2625) Problems with CLASSPATH in Job Submission REST API

2014-09-30 Thread Doug Haigh (JIRA)
Doug Haigh created YARN-2625:


 Summary: Problems with CLASSPATH in Job Submission REST API
 Key: YARN-2625
 URL: https://issues.apache.org/jira/browse/YARN-2625
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.5.1
Reporter: Doug Haigh


There are a couple of issues I have found specifying the CLASSPATH environment 
variable using the REST API.

1) In the Java client, the CLASSPATH environment is usually made up of either 
the value of the yarn.application.classpath in yarn-site.xml value or the 
default YARN classpath value as defined by 
YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API 
consumers have no method of telling the resource manager to use the default 
unless they hardcode the default value themselves. If the default ever changes, 
the code would need to change. 

2) If any environment variables are used in the CLASSPATH environment 'value' 
field, they are evaluated when the values are NULL resulting in bad values in 
the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the 
default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, 
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*, 
$HADOOP_HDFS_HOME/share/hadoop/hdfs/*, 
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, 
$HADOOP_YARN_HOME/share/hadoop/yarn/*, 
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the 
application master is 
:/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/*

These two problems require REST API consumers to always have the fully resolved 
path defined in the yarn.application.classpath value. If the property is 
missing or contains environment varaibles, the application created by the REST 
API will fail due to the CLASSPATH being incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2617:
---
Attachment: YARN-2617.3.patch

Update the patch. Delete an unrelated line.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153200#comment-14153200
 ] 

Hudson commented on YARN-2606:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1887 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1887/])
YARN-2606. Application History Server tries to access hdfs before doing secure 
login (Mit Desai via jeagles) (jeagles: rev 
e10eeaabce2a21840cfd5899493c9d2d4fe2e322)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java


 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Fix For: 2.6.0

 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153205#comment-14153205
 ] 

Hudson commented on YARN-1769:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1887 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1887/])
YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas 
Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java


 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.6.0

 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153262#comment-14153262
 ] 

Hudson commented on YARN-1769:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1912 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1912/])
YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas 
Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* hadoop-yarn-project/CHANGES.txt


 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.6.0

 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153257#comment-14153257
 ] 

Hudson commented on YARN-2606:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1912 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1912/])
YARN-2606. Application History Server tries to access hdfs before doing secure 
login (Mit Desai via jeagles) (jeagles: rev 
e10eeaabce2a21840cfd5899493c9d2d4fe2e322)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java
* hadoop-yarn-project/CHANGES.txt


 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Fix For: 2.6.0

 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store

2014-09-30 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153406#comment-14153406
 ] 

Zhijie Shen commented on YARN-2320:
---

[~mayank_bansal], thanks for the review.

bq. shouldn't we use N/A in convertToApplicationAttemptReport instead of null ?

generateApplicationReport is fixed in YARN-2598.

Attempt and container reports are different. While app report hides the details 
if the user doesn't have access, the attempt and the container will completely 
not be shown.

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch, YARN-2320.2.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2626) Document of timeline server needs to be updated

2014-09-30 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2626:
-

 Summary: Document of timeline server needs to be updated
 Key: YARN-2626
 URL: https://issues.apache.org/jira/browse/YARN-2626
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen


YARN-2033, the document is no longer accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2626) Document of timeline server needs to be updated

2014-09-30 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2626:
--
Target Version/s: 2.6.0

 Document of timeline server needs to be updated
 ---

 Key: YARN-2626
 URL: https://issues.apache.org/jira/browse/YARN-2626
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen

 YARN-2033, the document is no longer accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2626) Document of timeline server needs to be updated

2014-09-30 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2626:
--
  Component/s: timelineserver
Affects Version/s: 2.6.0

 Document of timeline server needs to be updated
 ---

 Key: YARN-2626
 URL: https://issues.apache.org/jira/browse/YARN-2626
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen

 YARN-2033, the document is no longer accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2594:
-
Attachment: YARN-2594.patch

Attached a updated patch removed several read lock of methods in {{RMAppImpl}} 
uses {{currentAttempt}} only.
[~kasha], [~jianhe], would you please take a look?

Thanks,
Wangda

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Xuan Gong (JIRA)
Xuan Gong created YARN-2627:
---

 Summary: Add logs when attemptFailuresValidityInterval is enabled
 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong


After YARN-611, users can specify attemptFailuresValidityInterval for their 
applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2627:

Attachment: YARN-2627.1.patch

 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-09-30 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153485#comment-14153485
 ] 

Jian Fang commented on YARN-1198:
-

I tried to merge in YARN-1857.3.patch and then merge in YARN-1198.7.patch since 
people favor this patch over the .8 patch. Seems the change in the following 
method cancels the update in YARN-1857.

  private Resource getHeadroom(User user, Resource queueMaxCap,
  Resource clusterResource, Resource userLimit) {
 Resource headroom =
Resources.subtract(
Resources.min(resourceCalculator, clusterResource, 
userLimit, queueMaxCap), 
user.getConsumedResources());
return headroom;
  }

Shouldn't it be the following one if I merge both YARN-1857 and YARN-1198?

  private Resource getHeadroom(User user, Resource queueMaxCap,
  Resource clusterResource, Resource userLimit) {
Resource headroom =
Resources.min(resourceCalculator, clusterResource,
Resources.subtract(
Resources.min(resourceCalculator, clusterResource,
userLimit, queueMaxCap),
user.getConsumedResources()),
Resources.subtract(queueMaxCap, usedResources));
return headroom;
  }



 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
 Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
 YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, 
 YARN-1198.8.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153491#comment-14153491
 ] 

Craig Welch commented on YARN-1972:
---

[~rusanu] [~vinodkv], as on [YARN-1063], we can go ahead and address these 
comments as part of the [YARN-2198] effort, it's not necessary to resolve these 
before these patches are committed.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, 
 YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153490#comment-14153490
 ] 

Craig Welch commented on YARN-1063:
---

[~rusanu] [~vinodkv], we can go ahead and address these comments as part of the 
[YARN-2198] effort, it's not necessary to resolve these before these patches 
are committed.

 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so will not be 
 able to display any information or create UI.
 * The launched process will have no network credentials. Any access of 
 network resources that requires domain authentication will fail.
 h2. Implementation:
 Winutils performs the following steps:
 # Enable the required privileges for the current process.
 # Register as a trusted process with the Local Security Authority (LSA).
 # Create a new logon for the user passed on the command line.
 # Load/Create a profile on the local machine for the new logon.
 # Create a new environment for the new logon.
 # Launch the new process in a job with the task name specified and using the 
 created logon.
 # Wait for the JOB to exit.
 h2. Future work:
 The following work was scoped out of this check in:
 * Support for non-domain users or machine that are not domain joined.
 * Support for privilege isolation by running the task launcher in a high 
 privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153494#comment-14153494
 ] 

Craig Welch commented on YARN-2198:
---

Bringing over some comments from [YARN-1063]

When looking this over to pickup context for 2198, I noticed a couple things:

libwinutils.c CreateLogonForUser - confusing name, makes me think a new
account is being created - CreateLogonTokenForUser? LogonUser?

TestWinUtils - can we add testing specific to security?

and from [YARN-1972]

ContainerLaunch
launchContainer - nit, why userName here, it's user everywhere else
getLocalWrapperScriptBuilder - why not an override instead of conditional (see 
below wrt WindowsContainerExecutor)

WindowsSecureContainerExecutor - I really think there should be a 
WindowsContainerExecutor and that we should go ahead and have differences 
move generally to inheritance rather than conditional (as far as 
reasonable/related to the change, and incrementally as we go forward, no need 
to boil the ocean, but it would be good to set a good foundation here) Windows 
specific logic, secure or not, should be based in this class. If the 
differences required for security specific logic are significant enough, by all 
means also have a WindowsSecureContainerExecutor which inherits from 
WindowsContainerExecutor. I think, as much as possible, the logic should be the 
same for both - with only the security specific functionality as a delta (right 
now, it looks like non-secure windows uses default for implementation, and may 
differ more from the windows secure than it should)


 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153499#comment-14153499
 ] 

Craig Welch commented on YARN-1198:
---

That's not intentional - I think it's just a side effect of where the changes 
are taking place, and it will require some manual fixup to keep both changes 
together.  I expected that [YARN-1857] would be committed first, and then I 
would fixup this patch to reflect the change.

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
 Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
 YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, 
 YARN-1198.8.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2627:

Attachment: YARN-2627.2.patch

 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153512#comment-14153512
 ] 

Craig Welch commented on YARN-1198:
---

[~leftnoteasy] [~john.jian.fang] it sounds like the .7 approach is the way to 
go.  Jian had a tweak to this approach which he suggested here: 
[https://issues.apache.org/jira/browse/YARN-1198?focusedCommentId=14122078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14122078]
 - on the whole the same thing happens, but it might be a cleaner way to do it. 
 I was hoping to give a go at it so that we could compare with .7 before 
closing this up.  Thoughts?

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
 Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
 YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, 
 YARN-1198.8.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153519#comment-14153519
 ] 

Zhijie Shen commented on YARN-2627:
---

+1, will commit after Jenkins' feedback.

 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue

2014-09-30 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-1963:
--
Attachment: (was: YARN Application Priorities Design.pdf)

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G

 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue

2014-09-30 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-1963:
--
Attachment: YARN Application Priorities Design.pdf

Attached updated design doc capturing comments.

Thank you.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
 Attachments: YARN Application Priorities Design.pdf


 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-30 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153526#comment-14153526
 ] 

Wangda Tan commented on YARN-2494:
--

I feel the Cluster is almost a super set of Collection. I prefer what 
[~cwelch] suggested set of method name.
Maybe {{ClusterNodeLabelsCollection}} is slightly clear than 
{{ClusterNodeLabels}}, but its name is too long I think :)

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch, YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153536#comment-14153536
 ] 

Karthik Kambatla commented on YARN-2594:


We need to handle getFinalApplicationStatus, and may be 
{{createAndGetApplicationReport}} as well. In the latter, we can replace direct 
access of {{diagnostics}} with {{getDiagnostics}} to avoid races on diagnostics.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153539#comment-14153539
 ] 

Karthik Kambatla commented on YARN-2610:


Checking this in.. 

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153541#comment-14153541
 ] 

Karthik Kambatla commented on YARN-2594:


Also, it would be nice to add a comment next to the declaration of 
currentAttempt to say it is not protected by the readLock. 

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153555#comment-14153555
 ] 

Hudson commented on YARN-2610:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6154 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6154/])
YARN-2610. Hamlet should close table tags. (Ray Chiang via kasha) (kasha: rev 
f7743dd07dfbe0dde9be71acfaba16ded52adba7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/hamlet/TestHamlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestInfoBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/hamlet/Hamlet.java
* hadoop-yarn-project/CHANGES.txt


 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153557#comment-14153557
 ] 

Hadoop QA commented on YARN-2594:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672075/YARN-2594.patch
  against trunk revision ea32a66.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5182//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5182//console

This message is automatically generated.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153608#comment-14153608
 ] 

zhihai xu commented on YARN-2594:
-

Hi [~leftnoteasy],
It will be good to use a local variable to save currentAttempt to avoid any 
potential null pointer exception in the future.

RMAppAttempt attempt = this.currentAttempt;
if (attempt != null) {
  return attempt.getTrackingUrl();
}

Without lock, it is possible that this.currentAttempt will be changed between 
null check and calling getTrackingUrl.
Using a local variable to save currentAttempt will solve this race condition.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-30 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153611#comment-14153611
 ] 

Remus Rusanu commented on YARN-2198:


[~cwelch]: thanks for the review! I will address many of the comments with new 
patch, meantime some reply on issues I won't address:

 pom.xml - don’t see a /etc/hadoop or a wsce-site.xml, missed?
RR: Not sure what you mean. Do you expect a default wcse-site.xml in 
hadoop-common/src/conf ?

 return (parent == null || parent2f.exists() || mkdirs(parent))  
 (mkOneDir(p2f) || p2f.isDirectory());
 so, I don't get this logic,  believe it will fail if the path exists and is 
 not a directory. Why not just do if p2f doesn't exist mkdirs(p2f)? seems much 
 simpler, and drops the need for mkOneDir
RR: This is actually the result of a problem Kevin hit during test deployments 
when NM has access to child dirs but is access denied to parent dirs. Old NM 
code would attempt to mkdir ever dir in the parent path, all the way to /. With 
existing dirs with access denied, this would fail, hence the need for my 
change. There is already a check in the unmodified code for the parent existing 
and not being a dir, couple of lines above my change.

 TestWinUtils:  can we add testing specific to security?
RR: I would like to add some, but is not at all easy. The core tenet of the 
WSCE is the elevated privilege required for S4U impersonation and having tests 
depend on that would pose many problems (false failures). Basically, starting 
the hadoopwinutilsvc service on the test box is unfeasable. 

 WindowsSecureContainerExecutor - I really think there should be a 
 WindowsContainerExecutor
RR: While I agree that the class architecture separation of secure vs. 
non-secure and Windows vs. Linux leaves room for improvement, it is not my goal 
with these JIRAs to address that problem. In fact I do have an explicit 
opposite mandate, to disturb all the non-secure code paths as little as 
possible, to minimize regression risks.

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153618#comment-14153618
 ] 

Hadoop QA commented on YARN-2627:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672087/YARN-2627.2.patch
  against trunk revision cdf1af0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5184//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5184//console

This message is automatically generated.

 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153628#comment-14153628
 ] 

Hadoop QA commented on YARN-2627:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672087/YARN-2627.2.patch
  against trunk revision cdf1af0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5185//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5185//console

This message is automatically generated.

 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153656#comment-14153656
 ] 

Hadoop QA commented on YARN-2613:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672090/YARN-2613.3.patch
  against trunk revision f7743dd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5186//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5186//console

This message is automatically generated.

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153672#comment-14153672
 ] 

Hudson commented on YARN-2627:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6155 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6155/])
YARN-2627. Added the info logs of attemptFailuresValidityInterval and number of 
previous failed attempts. Contributed by Xuan Gong. (zjshen: rev 
9582a50176800433ad3fa8829a50c28b859812a3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2594:
-
Attachment: YARN-2594.patch

[~zxu],
bq. It will be good to use a local variable to save currentAttempt to avoid any 
potential null pointer exception in the future.
Good catch! Addressed,

[~kasha],
bq. We need to handle getFinalApplicationStatus, and may be 
createAndGetApplicationReport as well. In the latter, we can replace direct 
access of diagnostics with getDiagnostics to avoid races on diagnostics.
{{getFinalApplicationStatus}} has access to statemachine.getCurrentState(), and 
{{createAndGetApplicationReport}} has accesses on 
statemachine.getCurrentState() and other Fields.
To minimize scope to solve the problem we can see now, I would suggest to keep 
other fields as-is. 

bq. Also, it would be nice to add a comment next to the declaration of 
currentAttempt to say it is not protected by the readLock.
Addressed,

New patch attached.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153697#comment-14153697
 ] 

Steve Loughran commented on YARN-913:
-

A couple of offline comments from sanjay radia

# don't publish full path in {{RegistryPathStatus}} fields; it only makes 
moving to indirection and cross references harder in future.
# don't differentiate hadoop-classic IPC from hadoop protobuf in protocol list.


 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.7.patch

Uploaded a new patch to address the comments by Jason.

{quote}

bq.I've changed it to Disk(s) health report: . My only concern with this 
is that there might be scripts looking for the Disk(s) failed log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the Disk(s) failed 
line and which disks became healthy with some other log message.
{quote}

Fixed. We now have two log messages - one indicating when disks go bad and one 
when disks get marked as good.

{quote}
bq.Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts. I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there. This could be part of the health check scan or done separately. That 
way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup. Sorta like an fsck for the NM state on 
disk. That is best done as a separate JIRA, as I think this functionality is 
still an incremental improvement without it.
{quote}

The current code will only cleanup if the NM recovery can't be carried out.
{noformat}
  if (!stateStore.canRecover()) {
cleanUpLocalDirs(lfs, delService);
initializeLocalDirs(lfs);
initializeLogDirs(lfs);
  }
{noformat}
Will that handle the case you mentioned?

bq. checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

Fixed.

bq. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if 
the free space is under the limit.

Fixed.
bq. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc 
comments like the other methods.

Fixed.

{quote}
Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that. Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.
{quote}

Fixed.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.

2014-09-30 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153709#comment-14153709
 ] 

zhihai xu commented on YARN-2623:
-

Thanks for the information. I think the better way to solve the issue is to 
choose the local directory which has the most free disk space.
I will implement the patch by copying the token file to the local directory 
which has the most free disk space.

 Linux container executor only use the first local directory to copy token 
 file in container-executor.c.
 ---

 Key: YARN-2623
 URL: https://issues.apache.org/jira/browse/YARN-2623
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux container executor only use the first local 
 directory to copy token file in container-executor.c.
Reporter: zhihai xu
Assignee: zhihai xu

 Linux container executor only use the first local directory to copy token 
 file in container-executor.c. if It failed to copy token file to the first 
 local directory, the  localization failure event will happen. Even though it 
 can copy token file to the other local directory successfully. The correct 
 way should be to copy token file  to the next local directory  if the first 
 one failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153735#comment-14153735
 ] 

Steve Loughran commented on YARN-913:
-

w.r.t first comment. putting path in the registry status field

* we don't actually need to do this, not if we return the stat'd entries as a 
map of name:status.
* and we can pull that operation,, currently called {{listFully}} out of 
operations and put in {{RegistryOperationsUtils}}. this will make clear it's a 
separate operation; we can emphasise it's non-atomic.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories

2014-09-30 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2624:

Description: 
We have found resource localization fails on a cluster with following error in 
certain cases.

{noformat}
INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
 1412027745352, FILE, null 
},pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
java.io.IOException: Rename cannot overwrite non empty destination directory 
/data/yarn/nm/filecache/27
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
{noformat}

  was:
We have found resource localization fails on a secure cluster with following 
error in certain cases. This happens at some indeterminate point after which it 
will keep failing until NM is restarted.

{noformat}
INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
 1412027745352, FILE, null 
},pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
java.io.IOException: Rename cannot overwrite non empty destination directory 
/data/yarn/nm/filecache/27
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
{noformat}

Summary: Resource Localization fails on a cluster due to existing cache 
directories  (was: Resource Localization fails on a secure cluster until nm are 
restarted)

 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 We have found resource localization fails on a cluster with following error 
 in certain cases.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc { { 
 hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
  1412027745352, FILE, null 
 },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /data/yarn/nm/filecache/27
   at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
   at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
   at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153740#comment-14153740
 ] 

Steve Loughran commented on YARN-913:
-

w.r.t [~aw]'s comments:

bq. If a client needs to talk to more than one ZK, it sounds like they are 
basically screwed.

If you are grabbing binding/configs via the CLI, it's not a worry, nor if you 
are talking to 1 ZK quorum with the same auth policy. Its when you start 
tuning SASL auth and some various timeouts that this arises. This is not an 
issue with the registry, it's the ZK client here.

bq. I was mainly looking at the hostname pattern:
{code}
+  String HOSTNAME_PATTERN =
+  ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]);
{code}
bq. It doesn't appear to support periods/dots.

That's just the pattern for entries in the registry path itself; you can't give 
a service a name like -#foo as DNS won't like it. Stick whatever you want in 
the fields themselves.

I'll javadoc that field

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153745#comment-14153745
 ] 

Steve Loughran commented on YARN-913:
-

[~aw]: new constant for you+ javadocs:
{code}
  /**
   * Pattern of a single entry in the registry path. : {@value}.
   * p
   * This is what constitutes a valid hostname according to current RFCs.
   * Alphanumeric first two and last one digit, alphanumeric
   * and hyphens allowed in between.
   * p
   * No upper limit is placed on the size of an entry.
   */
{code}
Better?

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories

2014-09-30 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153749#comment-14153749
 ] 

Anubhav Dhoot commented on YARN-2624:
-

What we see is a bunch of preexisting local resource cache directories conflict 
with the new resource download. The destination directory being chosen via 
uniqueNumberGenerator is choosing one of these and without 
[HADOOP-9438|https://issues.apache.org/jira/browse/HADOOP-9438] we dont know 
until the rename fails.
Resetting uniqueNumberGenerator based on recoverResource does not seem to be 
enough. We may need to check the state of the NM's cache directory and reset to 
the highest number in the directory 


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 We have found resource localization fails on a cluster with following error 
 in certain cases.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc { { 
 hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
  1412027745352, FILE, null 
 },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /data/yarn/nm/filecache/27
   at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
   at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
   at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153748#comment-14153748
 ] 

Steve Loughran commented on YARN-913:
-

Oh, and I renamed the field:
{code}
  /**
   * Pattern of a single entry in the registry path. : {@value}.
   * p
   * This is what constitutes a valid hostname according to current RFCs.
   * Alphanumeric first two and last one digit, alphanumeric
   * and hyphens allowed in between.
   * p
   * No upper limit is placed on the size of an entry.
   */
  String VALID_PATH_ENTRY_PATTERN =
  ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]);
  {code}

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153750#comment-14153750
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672125/apache-yarn-90.7.patch
  against trunk revision 9582a50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5187//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5187//console

This message is automatically generated.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-09-30 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-2628:
---

 Summary: Capacity scheduler with DominantResourceCalculator 
carries out reservation even though slots are free
 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev


We've noticed that if you run the CapacityScheduler with the 
DominantResourceCalculator, sometimes apps will end up with containers in a 
reserved state even though free slots are available.

The root cause seems to be this piece of code from CapacityScheduler.java -
{noformat}
// Try to schedule more if there are no reservations to fulfill
if (node.getReservedContainer() == null) {
  if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
  node.getAvailableResource(), minimumAllocation)) {
if (LOG.isDebugEnabled()) {
  LOG.debug(Trying to schedule on node:  + node.getNodeName() +
  , available:  + node.getAvailableResource());
}
root.assignContainers(clusterResource, node, false);
  }
} else {
  LOG.info(Skipping scheduling since node  + node.getNodeID() + 
   is reserved by application  + 
  node.getReservedContainer().getContainerId().getApplicationAttemptId()
  );
}
{noformat}

The code is meant to check if a node has any slots available for containers . 
Since it uses the greaterThanOrEqual function, we end up in situation where 
greaterThanOrEqual returns true, even though we may not have enough CPU or 
memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2629) Make distributed shell use the domain-based timeline ACLs

2014-09-30 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2629:
-

 Summary: Make distributed shell use the domain-based timeline ACLs
 Key: YARN-2629
 URL: https://issues.apache.org/jira/browse/YARN-2629
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


For demonstration the usage of this feature (YARN-2102), it's good to make the 
distributed shell create the domain, and post its timeline entities into this 
private space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs

2014-09-30 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2629:
--
 Component/s: timelineserver
Target Version/s: 2.6.0

 Make distributed shell use the domain-based timeline ACLs
 -

 Key: YARN-2629
 URL: https://issues.apache.org/jira/browse/YARN-2629
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 For demonstration the usage of this feature (YARN-2102), it's good to make 
 the distributed shell create the domain, and post its timeline entities into 
 this private space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153771#comment-14153771
 ] 

Karthik Kambatla commented on YARN-2594:


Fair enough. We could improve the locking in RMAppImpl further, but I guess the 
follow-up JIRA to fix SchedulerApplicationAttempt would take care of things in 
a better way.

+1, pending Jenkins. 

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153779#comment-14153779
 ] 

zhihai xu commented on YARN-2594:
-

The new patch looks good to me.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153824#comment-14153824
 ] 

Karthik Kambatla commented on YARN-2179:


Extending YarnClientImpl for the test seems reasonable to me. 

+1, assuming TestRemoteAppChecker is the only file changed in the latest patch. 

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Jian He (JIRA)
Jian He created YARN-2630:
-

 Summary: 
TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He


The problem is that after YARN-1372, the re-launched AM will also receive 
previously failed AM container. And DistributedShell logic is not expecting 
this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2630:
--
Description: The problem is that after YARN-1372, in work-preserving AM 
restart, the re-launched AM will also receive previously failed AM container. 
But DistributedShell logic is not expecting this extra completed container.
(was: The problem is that after YARN-1372, the re-launched AM will also receive 
previously failed AM container. And DistributedShell logic is not expecting 
this extra completed container.  )

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2630:
--
Attachment: YARN-2630.1.patch

Uploaded a patch to make RMAppAttempt not return AM container.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153888#comment-14153888
 ] 

Jason Lowe commented on YARN-2387:
--

+1 lgtm.  Committing this.

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-30 Thread Chris Trezzo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153907#comment-14153907
 ] 

Chris Trezzo commented on YARN-2179:


Yes that was the only file changed.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153921#comment-14153921
 ] 

Jason Lowe commented on YARN-2610:
--

Looks like branch-2.6 was just cut as this was going in and it missed that 
branch.  Karthik could you cherry-pick to that branch as well?

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153941#comment-14153941
 ] 

Karthik Kambatla commented on YARN-2610:


Thanks for catching it, Jason. Just cherry-picked to branch-2.6 as well. 

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153968#comment-14153968
 ] 

Hadoop QA commented on YARN-2594:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672114/YARN-2594.patch
  against trunk revision 9582a50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5188//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5188//console

This message is automatically generated.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2631) Modify DistributedShell to enable LogAggregationContext

2014-09-30 Thread Xuan Gong (JIRA)
Xuan Gong created YARN-2631:
---

 Summary: Modify DistributedShell to enable LogAggregationContext
 Key: YARN-2631
 URL: https://issues.apache.org/jira/browse/YARN-2631
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153975#comment-14153975
 ] 

Hudson commented on YARN-2387:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6156 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6156/])
YARN-2387. Resource Manager crashes with NPE due to lack of synchronization. 
Contributed by Mit Desai (jlowe: rev feaf139b4f327d33011e5a4424c06fb44c630955)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerStatusPBImpl.java
* hadoop-yarn-project/CHANGES.txt


 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2594:
---
Target Version/s: 2.6.0

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153982#comment-14153982
 ] 

Craig Welch commented on YARN-2198:
---

-re pom.xml - maybe I'm just confused, I saw a reference to this in the pom and 
assumed it needed to be somewhere in the project, I see it builds fine, so I 
guess no worries there.

TestWinUtils - so what I had in mind was mocking the native bit and having some 
tests for the proper behavior of the java components under various conditions - 
i realize this won't test the native code, which is significant, but it will 
test the java code for expected native code behavior, and there's non-trivial 
java code, strikes me as possible/worthwhile

WindowsSecureContainerExecutor - understandable as a tactical approach but I'm 
concerned with leaving it that way - among other things, there is quite a lot 
more testing opportunity with non-secure code paths as they will be exercised 
much more frequently in testing (doubly so with reference to your comment 
above...), by having the non-secure and secure line up more the secure path 
will end up being higher quality as most of it's codepaths will see a good deal 
more use/exercise/testing, especially when new functionality is added.  Also, 
changes going forward should require less effort if the windows path is mostly 
shared between secure and unsecure execution

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153981#comment-14153981
 ] 

Karthik Kambatla commented on YARN-2594:


Committing this. 

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153983#comment-14153983
 ] 

Craig Welch commented on YARN-2198:
---

there are a number of changes which impact common multi-platform code, has this 
been tested on non-Windows with security enabled (Linux) as well as windows?

it looks like this is only a 64 bit build now, where it used to be 64 and 32. I 
assume this is intentional and ok?

It would be really nice if we could start to separate out some of this new 
functionality from winutils, e.g., make the elevated service functionality 
independent.  I see that there is a jira for doing so down the road, which is 
good... it looks like the elevated privilages are just around creating local 
directories and (obviously) spawning the process.  If a stand-alone service 
just created and set permissions on those directories, and the java code simply 
checked for their existance and then moved on if they were present, I think 
that a lot of the back-and-forth of the elevation could be dropped to just one 
call to create the base directory and a second to spawn/hand back the output 
handles.  Is that correct?  

service.c

  // We're now transfering ownership of the duplicated handles to the caller
+  // If the RPC call fails *after* this point the handles are leaked inside 
the NM process

this is a little alarming.  Doesn't the close() call clean this up, regardless 
of success/ fail?

Have we done any profiling to make sure we're not leaking threads, thread 
stacks, memory, etc, in at least the happy case (and preferably some unhappy 
cases also)?  I think we need to, there's a fair bit of additional native 
code, and running it for a bit with a profiler could tell us quite a bit about 
whether or not we may be leaking something... 

why is this conditional check different from all the others?
+  dwError = ValidateConfigurationFile();
+  if (dwError) {

nit anonimous sp anonymous

hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h
just a line added, pls revert

ElevatedFileSystem

delete()
it appears that the tests for existance, etc, are run in a non-elevated way, 
while the actions are elevated.  Is it possible for permissions to be such that
the non-elevated tests do not see files/directories which are present for 
permission reasons? should those not be elevated also?

streamReaderThread.run - using the readLine() instead of following the simple 
buffer copy idiom in ShellCommandExecutor has some efficiency issues, granted 
it looks to be reading memory sized data so it may be no big deal, but it 
would be nice to follow the buffer-copy pattern instead

ContainerExecutor

comment on comment:

On Windows the ContainerLaunch creates a temporary empty jar to workaround the 
CLASSPATH length

not exactly, it looks like it creates a jar with a special manifest of other 
jars, it would be helpful to explain that in the comment so it's clear what's 
going on

ContainerLaunch

public void sanitizeEnv(...)

Can we please move the process of generating a new reference jar out of the 
sanitizeEnv method into it's own method (called ?conditionally? after 
sanitizeEnv)?  While there's a clear connection in terms of it's setting up 
the environment, it's building a new jar  I think it is doing more than just 
manipulating variables, so it belongs in a dedicated method, which can be 
called in call() after sanitizeEnv  I believe this also means that Path 
nmPrivateClasspathJarDir can be pulled from the sanitizeEnv signature.

ContainerLocalizer

LOG.info(String.format(nRet: %d, nRet)); - not sure this should be info 
level




 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is 

[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-09-30 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154007#comment-14154007
 ] 

zhihai xu commented on YARN-2254:
-

I uploaded a new patch YARN-2254.003.patch which rebase to the latest code base

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-09-30 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154011#comment-14154011
 ] 

Jian He commented on YARN-2602:
---

looks good to me.

 Generic History Service of TimelineServer sometimes not able to handle NPE
 --

 Key: YARN-2602
 URL: https://issues.apache.org/jira/browse/YARN-2602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
 Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 days, with many random example jobs running
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2602.1.patch


 ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 day, with many random example jobs running .
 When I ran WS API for AHS/GHS:
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001'
 {code}
 It ran successfully.
 However
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps'
 {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException}
 {code}
 Failed with Internal server error 500.
 After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154018#comment-14154018
 ] 

Hudson commented on YARN-2594:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6157 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6157/])
YARN-2594. Potential deadlock in RM when querying 
ApplicationResourceUsageReport. (Wangda Tan via kasha) (kasha: rev 
14d60dadc25b044a2887bf912ba5872367f2dffb)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2180) In-memory backing store for cache manager

2014-09-30 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: YARN-2180-trunk-v6.patch

[~kasha] [~vinodkv] [~sjlee0]

Attached is v6. Here are the major changes:
1. Moved in-memory implementation specific logic to check for initial apps from 
the cleaner service to the InMemorySCMStore. Also updated unit tests.
2. Got rid of InMemorySCMStoreConfiguration and added them back to 
YarnConfigruation with an in-memory store prefix.
3. Added configuration around AppChecker implementation in the in-memory store.
4. Changed synchronization of initialApps to use a separate lock object.
5. Annotated classes with private/evolving.
6. Addressed various notes from karthik.

One specific comment:
bq. For resources that are not in the store, isn't the access time trivially 
zero? I am okay with returning -1 for those cases, but will returning zero help 
at call sites?

I am going through and trying to verify if everything would be OK returning an 
access time of 0 instead of -1. If I remember correctly, this covered a case 
around SCM crashing and the Uploader service on the node manager. I will jog my 
memory and come up with a better response. The only place that this method is 
called is in the isResourceEvictable method.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, 
 YARN-2180-trunk-v6.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-09-30 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154057#comment-14154057
 ] 

Craig Welch commented on YARN-1198:
---

Sorry, the above was off - the conversation happened offline, here was the 
tweak to .7 that Jian suggested:

Hi Craig, I looked at your patch again. It's similar to what I thought. One 
thing is that now that headRoom is not application specific, it doesn't belong 
to application any more. We may make a member of LeafQueue#User. From 
CapacityScheduler#allocate, directly call LeafQueue #getAndCalculateHeadRoom , 
not going through SchedulerApplicationAttempt route to get the HeadRoom. I 
think this is simpler. do you think this will work?

 We may make a member of LeafQueue#User. To clarify: make the headRoom a 
 variable of LeafQueue#User, and remove that from SchedulerAttempt

we might, in this approach, do what we are doing in .7 but without the 
HeadroomProvider at all... I'm going to give a go at this...

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
 Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
 YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, 
 YARN-1198.8.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2180) In-memory backing store for cache manager

2014-09-30 Thread Chris Trezzo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154060#comment-14154060
 ] 

Chris Trezzo commented on YARN-2180:


I also removed the clearCache() method from SCMStore and InMemorySCMStore.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, 
 YARN-2180-trunk-v6.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-014.patch

Updated patch
# comments  renames the {{HOSTNAME_PATTERN}} field (for AW)
# registryOperationsStatus record holds the shortname of the stat'd record, not 
the full path (for Sanjay)
# moves the {{listFull}} operation to list then stat the children out of the 
core {{RegistryOperations}} API and into {{RegistryUtils}}, as it is a utility 
action built from the lower level operations. Migration to this across the 
codebase.
# made that stat operation robust against child entries being deleted during 
the action
# same for the registry purge: there may be race conditions of overlapping 
delete operations ... this is not an error

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-30 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154069#comment-14154069
 ] 

Wangda Tan commented on YARN-2594:
--

Thanks [~kasha], [~jianhe] and [~zxu] for review and commit!  

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2180) In-memory backing store for cache manager

2014-09-30 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: (was: YARN-2180-trunk-v6.patch)

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2180) In-memory backing store for cache manager

2014-09-30 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: YARN-2180-trunk-v6.patch

Re-attached v6.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, 
 YARN-2180-trunk-v6.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154087#comment-14154087
 ] 

Hudson commented on YARN-2602:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6158 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6158/])
YARN-2602. Fixed possible NPE in ApplicationHistoryManagerOnTimelineStore. 
Contributed by Zhijie Shen (jianhe: rev 
bbff96be48119774688981d04baf444639135977)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java


 Generic History Service of TimelineServer sometimes not able to handle NPE
 --

 Key: YARN-2602
 URL: https://issues.apache.org/jira/browse/YARN-2602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
 Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 days, with many random example jobs running
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2602.1.patch


 ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 day, with many random example jobs running .
 When I ran WS API for AHS/GHS:
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001'
 {code}
 It ran successfully.
 However
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps'
 {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException}
 {code}
 Failed with Internal server error 500.
 After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154084#comment-14154084
 ] 

Vinod Kumar Vavilapalli commented on YARN-2578:
---

bq. Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 
+1. The default from conf is 1min. Assuming it all boils down the ping 
interval, we should fix it in common.

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154098#comment-14154098
 ] 

Hadoop QA commented on YARN-2630:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672165/YARN-2630.1.patch
  against trunk revision 14d60da.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens
  
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5189//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5189//console

This message is automatically generated.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2632) Document NM Restart feature

2014-09-30 Thread Junping Du (JIRA)
Junping Du created YARN-2632:


 Summary: Document NM Restart feature
 Key: YARN-2632
 URL: https://issues.apache.org/jira/browse/YARN-2632
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du


As a new feature to YARN, we should document this feature's behavior, 
configuration, and things to pay attention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154128#comment-14154128
 ] 

Jian He commented on YARN-2617:
---

Thanks for updating !
- {{containerStatuses.add(status);}} is moved after this check 
{{status.getContainerState() == ContainerState.COMPLETE}}. In some cases(e.g. 
NM decommission), I think we still need to send the completeContainers across 
so that RM knows this container completes.
- we may not need to change {{getNMContainerStatuses}}, as this method will be 
invoked only once on re-register. I’m afraid not sending the whole containers 
for recovery will hit some other race conditions.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs

2014-09-30 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2629:
--
Attachment: YARN-2629.1.patch

Create a patch, which enable timeline domain feature for the distributed shell. 
User can specify the domain ID, the readers and writers via the options when 
submitting a DS job. DS client will automatically create one domain before 
submitting the app to YARN. AM will get the domain ID from env, and put all the 
entities into the domain of this ID.

 Make distributed shell use the domain-based timeline ACLs
 -

 Key: YARN-2629
 URL: https://issues.apache.org/jira/browse/YARN-2629
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2629.1.patch


 For demonstration the usage of this feature (YARN-2102), it's good to make 
 the distributed shell create the domain, and post its timeline entities into 
 this private space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2629) Make distributed shell use the domain-based timeline ACLs

2014-09-30 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154135#comment-14154135
 ] 

Zhijie Shen commented on YARN-2629:
---

The patch depends on the one on YARN-2446

 Make distributed shell use the domain-based timeline ACLs
 -

 Key: YARN-2629
 URL: https://issues.apache.org/jira/browse/YARN-2629
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2629.1.patch


 For demonstration the usage of this feature (YARN-2102), it's good to make 
 the distributed shell create the domain, and post its timeline entities into 
 this private space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154140#comment-14154140
 ] 

Hadoop QA commented on YARN-913:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672203/YARN-913-014.patch
  against trunk revision a469833.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5191//console

This message is automatically generated.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2630:
--
Attachment: YARN-2630.2.patch

Fixed test failures.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154173#comment-14154173
 ] 

Hadoop QA commented on YARN-2254:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672188/YARN-2254.003.patch
  against trunk revision a4c9b80.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5190//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5190//console

This message is automatically generated.

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154211#comment-14154211
 ] 

Ming Ma commented on YARN-90:
-

Thanks, Varun, Jason. Couple comments:

1. What if a dir is transitioned from DISK_FULL state to OTHER state? 
DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs 
properly. We can use some state machine for each dir and make sure each 
transition is covered.

2. DISK_FULL state is counted toward the error disk threshold by 
LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. 
If we believe DISK_FULL is mostly temporary issue, should we consider disks are 
healthy if disks only stay in DISK_FULL for some short period of time?

3. In AppLogAggregatorImpl.java, (Path[]) localAppLogDirs.toArray(new 
Path[localAppLogDirs.size()]).. It seems the (Path[]) cast isn't necessary.

4. What is the intention of numFailures? Method getNumFailures isn't used.

5. Nit: It is better to expand import java.util.*; in 
DirectoryCollection.java and LocalDirsHandlerService.java.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154222#comment-14154222
 ] 

Karthik Kambatla commented on YARN-2179:


Committing this.. 

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154239#comment-14154239
 ] 

Hudson commented on YARN-1492:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6161 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6161/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* hadoop-yarn-project/CHANGES.txt


 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
Priority: Critical
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
 YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
 shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
 shared_cache_design_v5.pdf, shared_cache_design_v6.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154240#comment-14154240
 ] 

Hudson commented on YARN-2179:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6161 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6161/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* hadoop-yarn-project/CHANGES.txt


 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Fix For: 2.7.0

 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154247#comment-14154247
 ] 

Hadoop QA commented on YARN-2630:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672219/YARN-2630.2.patch
  against trunk revision 9e9e9cf.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5192//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5192//console

This message is automatically generated.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2617:
---
Attachment: YARN-2617.4.patch

Update the patch. 

I am not sure whether I catch your point: send completed container one time 
even its corresponding Application is stopped then delete it from 
context.getContainers().  If so, we need modify corresponding test cases.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154334#comment-14154334
 ] 

Jian He commented on YARN-2617:
---

bq. send completed container one time even its corresponding Application is 
stopped then delete it from context.getContainers()
yep, because if we gracefully decommission a node, we also need to notify RM 
that the containers running on this node completes.
btw. once you uploaded a patch, you can click the submit patch, which will 
trigger jenkins to run the corresponding unit tests.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154337#comment-14154337
 ] 

Jun Gong commented on YARN-2617:


Get it. Thank you!

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154369#comment-14154369
 ] 

Hadoop QA commented on YARN-2617:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672239/YARN-2617.4.patch
  against trunk revision 17d1202.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5193//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5193//console

This message is automatically generated.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >