[jira] [Updated] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts

2014-04-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-1885:
-

Attachment: YARN-1885.patch

Attached new patch addressed discussions above, 
1) Included integration tests
2) Removed ContainerAcquiredEvent in RMAppAttempt
3) Added NodedAddedEvent  in RMApp

 RM may not send the finished signal to some nodes where the application ran 
 after RM restarts
 -

 Key: YARN-1885
 URL: https://issues.apache.org/jira/browse/YARN-1885
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
 Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch


 During our HA testing we have seen cases where yarn application logs are not 
 available through the cli but i can look at AM logs through the UI. RM was 
 also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-04-29 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984151#comment-13984151
 ] 

Devaraj K commented on YARN-1408:
-

bq. So in some race conditions, it is possible that a container can get KILLED 
by preemption even before it reach RUNNING state.
This scenario can be avoided if we can skip such containers which didnt reach 
the RUNNING state during preemption.
May be in the following cycles this container will reach RUNNING state and the 
can be considered for preemption.

I think we don't need to wait for the container to move to RUNNING state for 
preemption even if it is eligible. If the container is eligible for preemption, 
the resources can be released with the current preemption cycle instead of 
waiting for the next preemption cycle to change the container state to RUNNING, 
so that it could save the wastage of the container launching and then killing.

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
 Fix For: 2.5.0

 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-04-29 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984152#comment-13984152
 ] 

Devaraj K commented on YARN-1408:
-

Correction to the above comment.

Sorry for the delay Sunil.

{quote} So in some race conditions, it is possible that a container can get 
KILLED by preemption even before it reach RUNNING state.
This scenario can be avoided if we can skip such containers which didnt reach 
the RUNNING state during preemption.
May be in the following cycles this container will reach RUNNING state and the 
can be considered for preemption.
{quote}
I think we don't need to wait for the container to move to RUNNING state for 
preemption even if it is eligible. If the container is eligible for preemption, 
the resources can be released with the current preemption cycle instead of 
waiting for the next preemption cycle to change the container state to RUNNING, 
so that it could save the wastage of the container launching and then killing.

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
 Fix For: 2.5.0

 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

2014-04-29 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-1996:
---

 Summary: Provide alternative policies for UNHEALTHY nodes.
 Key: YARN-1996
 URL: https://issues.apache.org/jira/browse/YARN-1996
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov


Currently, UNHEALTHY nodes can significantly prolong execution of large 
expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
health even further due to [positive 
feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that 
might have deemed the node unhealthy in the first place starts spreading across 
the cluster because the current node is declared unusable and all its 
containers are killed and rescheduled on different nodes.

To mitigate this, we experiment with a patch that allows containers already 
running on a node turning UNHEALTHY to complete (drain) whereas no new 
container can be assigned to it until it turns healthy again.

This mechanism can also be used for graceful decommissioning of NM. To this 
end, we have to write a health script  such that it can deterministically 
report UNHEALTHY. For example with 
{code}
if [ -e $1 ] ; then 
   
  echo ERROR Node decommmissioning via health script hack   
   
fi 
{code}

In the current version patch, the behavior is controlled by a boolean property 
{{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are 
possible in the future work. Currently, the health state of a node is binary 
determined based on the disk checker and the health script ERROR outputs. 
However, we can as well interpret health script output similar to java logging 
levels (one of which is ERROR) such as WARN, FATAL. Each level can then be 
treated differently. E.g.,
- FATAL:  unusable like today 
- ERROR: drain
- WARN: halve the node capacity.
complimented with some equivalence rules such as 3 WARN messages == ERROR,  
2*ERROR == FATAL, etc. 









--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

2014-04-29 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1996:


Attachment: YARN-1996.v01.patch

 Provide alternative policies for UNHEALTHY nodes.
 -

 Key: YARN-1996
 URL: https://issues.apache.org/jira/browse/YARN-1996
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1996.v01.patch


 Currently, UNHEALTHY nodes can significantly prolong execution of large 
 expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
 health even further due to [positive 
 feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
 that might have deemed the node unhealthy in the first place starts spreading 
 across the cluster because the current node is declared unusable and all its 
 containers are killed and rescheduled on different nodes.
 To mitigate this, we experiment with a patch that allows containers already 
 running on a node turning UNHEALTHY to complete (drain) whereas no new 
 container can be assigned to it until it turns healthy again.
 This mechanism can also be used for graceful decommissioning of NM. To this 
 end, we have to write a health script  such that it can deterministically 
 report UNHEALTHY. For example with 
 {code}
 if [ -e $1 ] ; then   
  
   echo ERROR Node decommmissioning via health script hack 
  
 fi 
 {code}
 In the current version patch, the behavior is controlled by a boolean 
 property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile 
 policies are possible in the future work. Currently, the health state of a 
 node is binary determined based on the disk checker and the health script 
 ERROR outputs. However, we can as well interpret health script output similar 
 to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
 level can then be treated differently. E.g.,
 - FATAL:  unusable like today 
 - ERROR: drain
 - WARN: halve the node capacity.
 complimented with some equivalence rules such as 3 WARN messages == ERROR,  
 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

2014-04-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984167#comment-13984167
 ] 

Steve Loughran commented on YARN-1996:
--

This sounds good -not just as failure handling, but for cluster management

This may be a duplicate of YARN-914, graceful decommission of NM, and/or 
YARN-671

For long-lived services
# it'd be nice to have a notification from the NM to the AM that it's draining 
and that they should react : YARN-1394
# the drain process must have a (configurable?) timeout  then kill all 
outstanding containers -without adding them as any kind of failure (i.e. 
container loss event from NM - AM should indicate this)
# AM itself needs to receive a your own node is being drained event and do 
any best-effort pre-restart operations (e.g. transition to passive), and RM not 
count AM termination/restart as an AM failure


 Provide alternative policies for UNHEALTHY nodes.
 -

 Key: YARN-1996
 URL: https://issues.apache.org/jira/browse/YARN-1996
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1996.v01.patch


 Currently, UNHEALTHY nodes can significantly prolong execution of large 
 expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
 health even further due to [positive 
 feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
 that might have deemed the node unhealthy in the first place starts spreading 
 across the cluster because the current node is declared unusable and all its 
 containers are killed and rescheduled on different nodes.
 To mitigate this, we experiment with a patch that allows containers already 
 running on a node turning UNHEALTHY to complete (drain) whereas no new 
 container can be assigned to it until it turns healthy again.
 This mechanism can also be used for graceful decommissioning of NM. To this 
 end, we have to write a health script  such that it can deterministically 
 report UNHEALTHY. For example with 
 {code}
 if [ -e $1 ] ; then   
  
   echo ERROR Node decommmissioning via health script hack 
  
 fi 
 {code}
 In the current version patch, the behavior is controlled by a boolean 
 property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile 
 policies are possible in the future work. Currently, the health state of a 
 node is binary determined based on the disk checker and the health script 
 ERROR outputs. However, we can as well interpret health script output similar 
 to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
 level can then be treated differently. E.g.,
 - FATAL:  unusable like today 
 - ERROR: drain
 - WARN: halve the node capacity.
 complimented with some equivalence rules such as 3 WARN messages == ERROR,  
 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1980) Possible NPE in KillAMPreemptionPolicy related to ProportionalCapacityPreemptionPolicy

2014-04-29 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984188#comment-13984188
 ] 

Devaraj K commented on YARN-1980:
-

Changes involve in mapreduce project, moving to mapreduce.

 Possible NPE in KillAMPreemptionPolicy related to 
 ProportionalCapacityPreemptionPolicy
 --

 Key: YARN-1980
 URL: https://issues.apache.org/jira/browse/YARN-1980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Sunil G
 Attachments: Yarn-1980.1.patch


 I configured KillAMPreemptionPolicy for My Application Master and tried to 
 check preemption of queues.
 In one scenario I have seen below NPE in my AM
 014-04-24 15:11:08,860 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.preemption.KillAMPreemptionPolicy.preempt(KillAMPreemptionPolicy.java:57)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:662)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:246)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:267)
   at java.lang.Thread.run(Thread.java:662)
 I was using 2.2.0 and merged MAPREDUCE-5189 to see how AM preemption works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1997) input split size

2014-04-29 Thread rrim (JIRA)
rrim created YARN-1997:
--

 Summary: input split size
 Key: YARN-1997
 URL: https://issues.apache.org/jira/browse/YARN-1997
 Project: Hadoop YARN
  Issue Type: Test
  Components: api
Affects Versions: 2.2.0
Reporter: rrim


Hi, 
I am using hadoop 2.2, and don't know how to set max input split size I would 
like to decrease this value, in order to create more mappers I tried updating 
yarn-site.xml, and but it does not work

indeed, hadoop 2.2 /yarn does not take of none the following settings

property
namemapreduce.input.fileinputformat.split.minsize/name
value1/value
/property
property
namemapreduce.input.fileinputformat.split.maxsize/name
value16777216/value
/property

property
namemapred.min.split.size/name
value1/value
/property
property
namemapred.max.split.size/name
value16777216/value
/property
best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1982) Rename the daemon name to timelineserver

2014-04-29 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984268#comment-13984268
 ] 

Junping Du commented on YARN-1982:
--

+1. Patch looks good to me.

 Rename the daemon name to timelineserver
 

 Key: YARN-1982
 URL: https://issues.apache.org/jira/browse/YARN-1982
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.4.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
  Labels: cli
 Attachments: YARN-1982.1.patch


 Nowadays, it's confusing that we call the new component timeline server, but 
 we use
 {code}
 yarn historyserver
 yarn-daemon.sh start historyserver
 {code}
 to start the daemon.
 Before the confusion keeps being propagated, we'd better to modify command 
 line asap.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1912) ResourceLocalizer started without any jvm memory control

2014-04-29 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-1912:
---

Attachment: YARN-1912-1.patch

updated the patch based on findbugs warnings.

 ResourceLocalizer started without any jvm memory control
 

 Key: YARN-1912
 URL: https://issues.apache.org/jira/browse/YARN-1912
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: stanley shi
 Attachments: YARN-1912-0.patch, YARN-1912-1.patch


 In the LinuxContainerExecutor.java#startLocalizer, it does not specify any 
 -Xmx configurations in the command, this caused the ResourceLocalizer to be 
 started with default memory setting.
 In an server-level hardware, it will use 25% of the system memory as the max 
 heap size, this will cause memory issue in some cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions

2014-04-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1987:
-

Attachment: YARN-1987.patch

 Wrapper for leveldb DBIterator to aid in handling database exceptions
 -

 Key: YARN-1987
 URL: https://issues.apache.org/jira/browse/YARN-1987
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1987.patch


 Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
 utility wrapper around leveldb's DBIterator to translate the raw 
 RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
 database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984432#comment-13984432
 ] 

Hadoop QA commented on YARN-1996:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642433/YARN-1996.v01.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3654//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3654//console

This message is automatically generated.

 Provide alternative policies for UNHEALTHY nodes.
 -

 Key: YARN-1996
 URL: https://issues.apache.org/jira/browse/YARN-1996
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1996.v01.patch


 Currently, UNHEALTHY nodes can significantly prolong execution of large 
 expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
 health even further due to [positive 
 feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
 that might have deemed the node unhealthy in the first place starts spreading 
 across the cluster because the current node is declared unusable and all its 
 containers are killed and rescheduled on different nodes.
 To mitigate this, we experiment with a patch that allows containers already 
 running on a node turning UNHEALTHY to complete (drain) whereas no new 
 container can be assigned to it until it turns healthy again.
 This mechanism can also be used for graceful decommissioning of NM. To this 
 end, we have to write a health script  such that it can deterministically 
 report UNHEALTHY. For example with 
 {code}
 if [ -e $1 ] ; then   
  
   echo ERROR Node decommmissioning via health script hack 
  
 fi 
 {code}
 In the current version patch, the behavior is controlled by a boolean 
 property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile 
 policies are possible in the future work. Currently, the health state of a 
 node is binary determined based on the disk checker and the health script 
 ERROR outputs. However, we can as well interpret health script output similar 
 to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
 level can then be treated differently. E.g.,
 - FATAL:  unusable like today 
 - ERROR: drain
 - WARN: halve the node capacity.
 complimented with some equivalence rules such as 3 WARN messages == ERROR,  
 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984490#comment-13984490
 ] 

Hadoop QA commented on YARN-1987:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642479/YARN-1987.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3657//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3657//console

This message is automatically generated.

 Wrapper for leveldb DBIterator to aid in handling database exceptions
 -

 Key: YARN-1987
 URL: https://issues.apache.org/jira/browse/YARN-1987
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1987.patch


 Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
 utility wrapper around leveldb's DBIterator to translate the raw 
 RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
 database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984485#comment-13984485
 ] 

Hadoop QA commented on YARN-1912:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642478/YARN-1912-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3656//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3656//console

This message is automatically generated.

 ResourceLocalizer started without any jvm memory control
 

 Key: YARN-1912
 URL: https://issues.apache.org/jira/browse/YARN-1912
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: stanley shi
 Attachments: YARN-1912-0.patch, YARN-1912-1.patch


 In the LinuxContainerExecutor.java#startLocalizer, it does not specify any 
 -Xmx configurations in the command, this caused the ResourceLocalizer to be 
 started with default memory setting.
 In an server-level hardware, it will use 25% of the system memory as the max 
 heap size, this will cause memory issue in some cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-738) TestClientRMTokens is failing irregularly while running all yarn tests

2014-04-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984587#comment-13984587
 ] 

Hudson commented on YARN-738:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5584 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5584/])
YARN-738. TestClientRMTokens is failing irregularly while running all yarn 
tests. Contributed by Ming Ma (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591030)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMTokens.java


 TestClientRMTokens is failing irregularly while running all yarn tests
 --

 Key: YARN-738
 URL: https://issues.apache.org/jira/browse/YARN-738
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Ming Ma
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-738.patch


 Running org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
 Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.787 sec 
  FAILURE!
 testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 186 sec   ERROR!
 java.lang.RuntimeException: getProxy
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens$YarnBadRPC.getProxy(TestClientRMTokens.java:334)
   at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:157)
   at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:102)
   at org.apache.hadoop.security.token.Token.renew(Token.java:372)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:306)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:240)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
   at 
 org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
   at 
 org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
   at 
 org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
   at 
 org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

2014-04-29 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984664#comment-13984664
 ] 

Gera Shegalov commented on YARN-1996:
-

[~ste...@apache.org]  thanks for pointing out the JIRA about decommissioning. 
I'll link them to this JIRA. The main point of this JIRA is to gracefully deal 
with the UNHEALTHY state determined by the health script. 

 Provide alternative policies for UNHEALTHY nodes.
 -

 Key: YARN-1996
 URL: https://issues.apache.org/jira/browse/YARN-1996
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1996.v01.patch


 Currently, UNHEALTHY nodes can significantly prolong execution of large 
 expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
 health even further due to [positive 
 feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
 that might have deemed the node unhealthy in the first place starts spreading 
 across the cluster because the current node is declared unusable and all its 
 containers are killed and rescheduled on different nodes.
 To mitigate this, we experiment with a patch that allows containers already 
 running on a node turning UNHEALTHY to complete (drain) whereas no new 
 container can be assigned to it until it turns healthy again.
 This mechanism can also be used for graceful decommissioning of NM. To this 
 end, we have to write a health script  such that it can deterministically 
 report UNHEALTHY. For example with 
 {code}
 if [ -e $1 ] ; then   
  
   echo ERROR Node decommmissioning via health script hack 
  
 fi 
 {code}
 In the current version patch, the behavior is controlled by a boolean 
 property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile 
 policies are possible in the future work. Currently, the health state of a 
 node is binary determined based on the disk checker and the health script 
 ERROR outputs. However, we can as well interpret health script output similar 
 to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
 level can then be treated differently. E.g.,
 - FATAL:  unusable like today 
 - ERROR: drain
 - WARN: halve the node capacity.
 complimented with some equivalence rules such as 3 WARN messages == ERROR,  
 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-04-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1362:
-

Attachment: YARN-1362.patch

Small patch that enhances the NM context that provides get/set for a decomm 
flag.  This allows code to query whether the NM has been told to decommission 
and act accordingly during shutdown.

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984699#comment-13984699
 ] 

Vinod Kumar Vavilapalli commented on YARN-1929:
---

Seems 'fine' to me. It is one of those 
fine-for-now-but-not-sure-if-anything-else-is-broken.

OTOH, we aren't getting rid of the remaining locking in CompositeService. 
Something that we should fix separately. Don't want this patch to blow up more.

The test looks fine except for the 1second sleep. I can see that causing issues 
on VMs but let's see.

Checking this in.

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default

2014-04-29 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1612:
--

Attachment: YARN-1612-v2.patch

 Change Fair Scheduler to not disable delay scheduling by default
 

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts

2014-04-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984722#comment-13984722
 ] 

Jian He commented on YARN-1885:
---

Thanks for the update!
- some places exceed the 80 column limit, like the RMAppImpl transitions.
- app.isAppFinalStateStored() better use isAppInFinalState instead ?
- sleeping for a fixed amount time is not deterministic, test may fail 
randomly. it’s better doing it in a while loop with heartbeats, and exit out of 
the loop if condition meets.
{code}
// sleep for a while before do next heartbeat
Thread.sleep(1000);
NodeHeartbeatResponse response = nm1.nodeHeartbeat(true);
{code}
- timeout = 60, timeout too long.
- these two transitions cannot happen? Generally, we should not add events to 
states where the transitions can never happen, that’ll hide bugs.
{code}
.addTransition(RMAppState.NEW, RMAppState.NEW, RMAppEventType.NODE_ADDED,
new NodeAddedTransition())
.addTransition(RMAppState.NEW_SAVING, RMAppState.NEW_SAVING, 
RMAppEventType.NODE_ADDED,
new NodeAddedTransition())
{code}
- These two loops may block the register RPC call for a while, I think we may 
send them as the payload of RMNodeStartEvent and handle them in 
RMNodeAddTransition ?
{code}
// Handle container statuses reported by NM
if (!request.getContainerStatuses().isEmpty()) {
  LOG.info(received container statuses on node manager register :
  + request.getContainerStatuses());
  for (ContainerStatus containerStatus : request.getContainerStatuses()) {
handleContainerStatus(containerStatus);
  }
}

// Handle running applications reported by NM
if (null != request.getRunningApplications()) {
  for (ApplicationId appId : request.getRunningApplications()) {
handleRunningAppOnNode(appId, request.getNodeId());
  }
}
{code}

 RM may not send the finished signal to some nodes where the application ran 
 after RM restarts
 -

 Key: YARN-1885
 URL: https://issues.apache.org/jira/browse/YARN-1885
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
 Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch


 During our HA testing we have seen cases where yarn application logs are not 
 available through the cli but i can look at AM logs through the UI. RM was 
 also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984843#comment-13984843
 ] 

Hudson commented on YARN-1929:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5585 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5585/])
YARN-1929. Fixed a deadlock in ResourceManager that occurs when failover 
happens right at the time of shutdown. Contributed by Karthik Kambatla. 
(vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591071)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/service/CompositeService.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java


 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.4.1

 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984884#comment-13984884
 ] 

Hadoop QA commented on YARN-1362:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642514/YARN-1362.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3659//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3659//console

This message is automatically generated.

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984896#comment-13984896
 ] 

Hadoop QA commented on YARN-1612:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642521/YARN-1612-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3658//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3658//console

This message is automatically generated.

 Change Fair Scheduler to not disable delay scheduling by default
 

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts

2014-04-29 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984945#comment-13984945
 ] 

Wangda Tan commented on YARN-1885:
--

[~jianhe], Thanks for your review!
bq. some places exceed the 80 column limit, like the RMAppImpl transitions.
Will correct this later
bq. app.isAppFinalStateStored() better use isAppInFinalState instead ?
Agree, it's a bug using isAppFinalStateStored()
bq. sleeping for a fixed amount time is not deterministic, test may fail 
randomly. it’s better doing it in a while loop with heartbeats, and exit out of 
the loop if condition meets.
Agree
bq. timeout = 60, timeout too long.
Sorry for this typo :)
bq. these two transitions cannot happen? Generally, we should not add events to 
states where the transitions can never happen, that’ll hide bugs.
Agree, and I think SUBMITTED is also cannot happen, because an app with 
SUBMITTED state doesn't launch any container, so NMs will not have the app in 
runningApplication list. Do you agree? 
bq. These two loops may block the register RPC call for a while, I think we may 
send them as the payload of RMNodeStartEvent and handle them in 
RMNodeAddTransition ?
IMO, this shouldn't be a big problem, because there's no blocking calls existed 
in handleRunningAppOnNode/handleContainerStatus. So additional microseconds of 
latency (just loop array) should be fine. Is it?
Attached new patch.

 RM may not send the finished signal to some nodes where the application ran 
 after RM restarts
 -

 Key: YARN-1885
 URL: https://issues.apache.org/jira/browse/YARN-1885
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
 Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch


 During our HA testing we have seen cases where yarn application logs are not 
 available through the cli but i can look at AM logs through the UI. RM was 
 also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts

2014-04-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-1885:
-

Attachment: YARN-1885.patch

 RM may not send the finished signal to some nodes where the application ran 
 after RM restarts
 -

 Key: YARN-1885
 URL: https://issues.apache.org/jira/browse/YARN-1885
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
 Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, 
 YARN-1885.patch


 During our HA testing we have seen cases where yarn application logs are not 
 available through the cli but i can look at AM logs through the UI. RM was 
 also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1696) Document RM HA

2014-04-29 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1696:
--

Attachment: YARN-1696.6.patch

Same patch as before but with a few edits to make it better.

Will check this in once Jenkins says okay.

 Document RM HA
 --

 Key: YARN-1696
 URL: https://issues.apache.org/jira/browse/YARN-1696
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-1676.5.patch, YARN-1696-3.patch, YARN-1696.2.patch, 
 YARN-1696.4.patch, YARN-1696.6.patch, rm-ha-overview.png, rm-ha-overview.svg, 
 yarn-1696-1.patch


 Add documentation for RM HA. Marking this a blocker for 2.4 as this is 
 required to call RM HA Stable and ready for public consumption. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1696) Document RM HA

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985067#comment-13985067
 ] 

Hadoop QA commented on YARN-1696:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642567/YARN-1696.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3660//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3660//console

This message is automatically generated.

 Document RM HA
 --

 Key: YARN-1696
 URL: https://issues.apache.org/jira/browse/YARN-1696
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-1676.5.patch, YARN-1696-3.patch, YARN-1696.2.patch, 
 YARN-1696.4.patch, YARN-1696.6.patch, rm-ha-overview.png, rm-ha-overview.svg, 
 yarn-1696-1.patch


 Add documentation for RM HA. Marking this a blocker for 2.4 as this is 
 required to call RM HA Stable and ready for public consumption. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1998) Change the time zone on the Yarn UI to the local time zone

2014-04-29 Thread Fengdong Yu (JIRA)
Fengdong Yu created YARN-1998:
-

 Summary: Change the time zone on the Yarn UI to the local time zone
 Key: YARN-1998
 URL: https://issues.apache.org/jira/browse/YARN-1998
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Fengdong Yu
Priority: Minor


It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we 
should show the local time zone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1998) Change the time zone on the Yarn UI to the local time zone

2014-04-29 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated YARN-1998:
--

Attachment: YARN-1998.patch

 Change the time zone on the Yarn UI to the local time zone
 --

 Key: YARN-1998
 URL: https://issues.apache.org/jira/browse/YARN-1998
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Fengdong Yu
Priority: Minor
 Attachments: YARN-1998.patch


 It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we 
 should show the local time zone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1998) Change the time zone on the RM web UI to the local time zone

2014-04-29 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated YARN-1998:
--

Summary: Change the time zone on the RM web UI to the local time zone  
(was: Change the time zone on the Yarn UI to the local time zone)

 Change the time zone on the RM web UI to the local time zone
 

 Key: YARN-1998
 URL: https://issues.apache.org/jira/browse/YARN-1998
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Fengdong Yu
Priority: Minor
 Attachments: YARN-1998.patch


 It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we 
 should show the local time zone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section

2014-04-29 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1999:
---

Affects Version/s: 2.4.0

 Move HistoryServerRest.apt.vm into the Mapreduce section
 

 Key: YARN-1999
 URL: https://issues.apache.org/jira/browse/YARN-1999
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash

 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm into the MapReduce section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section

2014-04-29 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1999:
---

Component/s: documentation

 Move HistoryServerRest.apt.vm into the Mapreduce section
 

 Key: YARN-1999
 URL: https://issues.apache.org/jira/browse/YARN-1999
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash

 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm into the MapReduce section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section

2014-04-29 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1999:
---

Target Version/s: 2.5.0

 Move HistoryServerRest.apt.vm into the Mapreduce section
 

 Key: YARN-1999
 URL: https://issues.apache.org/jira/browse/YARN-1999
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash

 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm into the MapReduce section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section

2014-04-29 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-1999:
--

 Summary: Move HistoryServerRest.apt.vm into the Mapreduce section
 Key: YARN-1999
 URL: https://issues.apache.org/jira/browse/YARN-1999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ravi Prakash


Now that we have the YARN HistoryServer, perhaps we should move 
HistoryServerRest.apt.vm into the MapReduce section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985083#comment-13985083
 ] 

Hadoop QA commented on YARN-1885:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642551/YARN-1885.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3661//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3661//console

This message is automatically generated.

 RM may not send the finished signal to some nodes where the application ran 
 after RM restarts
 -

 Key: YARN-1885
 URL: https://issues.apache.org/jira/browse/YARN-1885
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
 Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, 
 YARN-1885.patch


 During our HA testing we have seen cases where yarn application logs are not 
 available through the cli but i can look at AM logs through the UI. RM was 
 also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1998) Change the time zone on the RM web UI to the local time zone

2014-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985105#comment-13985105
 ] 

Hadoop QA commented on YARN-1998:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642581/YARN-1998.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3662//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3662//console

This message is automatically generated.

 Change the time zone on the RM web UI to the local time zone
 

 Key: YARN-1998
 URL: https://issues.apache.org/jira/browse/YARN-1998
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Fengdong Yu
Priority: Minor
 Attachments: YARN-1998.patch


 It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we 
 should show the local time zone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2001) Persist NMs info for RM restart

2014-04-29 Thread Jian He (JIRA)
Jian He created YARN-2001:
-

 Summary: Persist NMs info for RM restart
 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He


RM should not accept allocate requests from AMs until all the NMs have 
registered with RM. For that, RM needs to remember the previous NMs and wait 
for all the NMs to register.
This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-04-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985148#comment-13985148
 ] 

Jian He commented on YARN-556:
--

Hi Anubhav,
Looked at the prototype patch. Regarding the approach, it’s better to have a 
scheduler-agnostic recovery mechanism with no or minimum  scheduler-specific 
changes, instead of implementing each scheduler specifically. YARN-1368 can be 
renamed to accommodate  the necessary common changes for all schedulers.Also, 
adding cluster timestamp to the container Id doesn’t  seem right and that’ll 
also break compatibility. 


 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-04-29 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1368:
--

Summary: Common work to re-populate containers’ state into scheduler  (was: 
RM should populate running container allocation information from NM resync)

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-04-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985151#comment-13985151
 ] 

Jian He commented on YARN-1368:
---

Hi [~adhoot], mind if I take this over ? I have a preliminary patch which does 
the bulk of the work. I can upload very soon. thx

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-04-29 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985160#comment-13985160
 ] 

Sunil G commented on YARN-1963:
---

We have done few analysis and implemented support for application priority.
I wish to share the thoughts here, kindly check the same.

Design thoughts:
1. Configuration Part
We planned to use some existing priority configuration as given below. These 
are used to set a Job priority.
a.  JobConf.getJobPriority() and Job.setPriority(JobPriority priority) 
b.  We can also use configuration mapreduce.job.priority.

The values for priority can be VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

2. Scheduler Side
If the Capacity Scheduler queue has multiple applications(Jobs) to run with 
different priorities, CS will allocate containers for the highest priority 
application and then for next priority and so on.
When multiple queues are configured with different capacities, this priority 
will work internal to the each queue.

For this, we planned to add a priority comparison check in the below data 
structure.
ComparatorFiCaSchedulerApp applicationComparator

We added a priority check here in compare() of applicationComparator while 
selecting applications. Updated design here will be like,
1.  Check for priority first. If there, return highest priority job.
2.  Continue existing logic such as App ID comparison and TimeStamp 
comparison.

With these changes, we can make highest priority job will get preference in a 
queue.

NB: In addition to this, we added a preemption module also to get High priority 
jobs resources fast by preempting lower priority ones.

I wish to upload a patch if this approach is fine.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-04-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985164#comment-13985164
 ] 

Jian He commented on YARN-1368:
---

It's good to have a scheduler-agnostic way to recover the containers and all 
the other scheduler states for app/attempts. Renamed the title to do this.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-04-29 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985193#comment-13985193
 ] 

Sandy Ryza commented on YARN-1963:
--

Thanks for picking this up Sunil.  Can we separate this into a couple JIRAs?  
One for the ResourceManager and protocol changes, one for the MapReduce 
changes, and one for the Capacity Scheduler changes.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)