[jira] [Commented] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService

2014-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960988#comment-13960988
 ] 

Hadoop QA commented on YARN-1904:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638841/YARN-1904.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3515//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3515//console

This message is automatically generated.

> Uniform the NotFound messages from ClientRMService and 
> ApplicationHistoryClientService
> --
>
> Key: YARN-1904
> URL: https://issues.apache.org/jira/browse/YARN-1904
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1904.1.patch
>
>
> It's good to make ClientRMService and ApplicationHistoryClientService throw 
> NotFoundException with similar messages



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1904:
--

Target Version/s: 2.4.1

> Uniform the NotFound messages from ClientRMService and 
> ApplicationHistoryClientService
> --
>
> Key: YARN-1904
> URL: https://issues.apache.org/jira/browse/YARN-1904
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1904.1.patch
>
>
> It's good to make ClientRMService and ApplicationHistoryClientService throw 
> NotFoundException with similar messages



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1904:
--

Attachment: YARN-1904.1.patch

Create a patch, simple message editing, without new test cases

> Uniform the NotFound messages from ClientRMService and 
> ApplicationHistoryClientService
> --
>
> Key: YARN-1904
> URL: https://issues.apache.org/jira/browse/YARN-1904
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1904.1.patch
>
>
> It's good to make ClientRMService and ApplicationHistoryClientService throw 
> NotFoundException with similar messages



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService

2014-04-04 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1904:
-

 Summary: Uniform the NotFound messages from ClientRMService 
and ApplicationHistoryClientService
 Key: YARN-1904
 URL: https://issues.apache.org/jira/browse/YARN-1904
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


It's good to make ClientRMService and ApplicationHistoryClientService throw 
NotFoundException with similar messages



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1701) Improve default paths of timeline store and generic history store

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1701:
--

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-321)

> Improve default paths of timeline store and generic history store
> -
>
> Key: YARN-1701
> URL: https://issues.apache.org/jira/browse/YARN-1701
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1701.v01.patch
>
>
> When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
> in AHS webUI. This is due to NullApplicationHistoryStore as 
> yarn.resourcemanager.history-writer.class. It would be good to have just one 
> key to enable basic functionality.
> yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
> local file system location. However, FileSystemApplicationHistoryStore uses 
> DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1701) Improve default paths of timeline store and generic history store

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1701:
--

Summary: Improve default paths of timeline store and generic history store  
(was: More intuitive defaults for AHS)

> Improve default paths of timeline store and generic history store
> -
>
> Key: YARN-1701
> URL: https://issues.apache.org/jira/browse/YARN-1701
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1701.v01.patch
>
>
> When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
> in AHS webUI. This is due to NullApplicationHistoryStore as 
> yarn.resourcemanager.history-writer.class. It would be good to have just one 
> key to enable basic functionality.
> yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
> local file system location. However, FileSystemApplicationHistoryStore uses 
> DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1701) More intuitive defaults for AHS

2014-04-04 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960967#comment-13960967
 ] 

Zhijie Shen commented on YARN-1701:
---

[~jira.shegalov], would you mind updating the patch? It no longer applies. And 
can we have one shot fix for both the timeline store and the generic history 
store path? Thanks!

> More intuitive defaults for AHS
> ---
>
> Key: YARN-1701
> URL: https://issues.apache.org/jira/browse/YARN-1701
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1701.v01.patch
>
>
> When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
> in AHS webUI. This is due to NullApplicationHistoryStore as 
> yarn.resourcemanager.history-writer.class. It would be good to have just one 
> key to enable basic functionality.
> yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
> local file system location. However, FileSystemApplicationHistoryStore uses 
> DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1701) More intuitive defaults for AHS

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1701:
--

Affects Version/s: (was: 2.4.0)
   2.4.1

> More intuitive defaults for AHS
> ---
>
> Key: YARN-1701
> URL: https://issues.apache.org/jira/browse/YARN-1701
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1701.v01.patch
>
>
> When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
> in AHS webUI. This is due to NullApplicationHistoryStore as 
> yarn.resourcemanager.history-writer.class. It would be good to have just one 
> key to enable basic functionality.
> yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
> local file system location. However, FileSystemApplicationHistoryStore uses 
> DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM

2014-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960941#comment-13960941
 ] 

Hudson commented on YARN-1898:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5460 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5460/])
YARN-1898. Addendum patch to ensure /jmx and /metrics are re-directed to Active 
RM. (acmurthy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1584954)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java


> Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are 
> redirecting to Active RM
> -
>
> Key: YARN-1898
> URL: https://issues.apache.org/jira/browse/YARN-1898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Yesha Vora
>Assignee: Xuan Gong
> Fix For: 2.4.1
>
> Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, 
> YARN-1898.addendum.patch, YARN-1898.addendum.patch
>
>
> Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to 
> Active RM.
> It should not be redirected to Active RM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1878) Yarn standby RM taking long to transition to active

2014-04-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960929#comment-13960929
 ] 

Arun C Murthy commented on YARN-1878:
-

[~xgong] is this ready to go? Let's get this into 2.4.1. Tx.

> Yarn standby RM taking long to transition to active
> ---
>
> Key: YARN-1878
> URL: https://issues.apache.org/jira/browse/YARN-1878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
> Attachments: YARN-1878.1.patch
>
>
> In our HA tests we are noticing that some times it can take upto 10s for the 
> standby RM to transition to active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1878) Yarn standby RM taking long to transition to active

2014-04-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-1878:


Target Version/s: 2.4.1

> Yarn standby RM taking long to transition to active
> ---
>
> Key: YARN-1878
> URL: https://issues.apache.org/jira/browse/YARN-1878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
> Attachments: YARN-1878.1.patch
>
>
> In our HA tests we are noticing that some times it can take upto 10s for the 
> standby RM to transition to active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1878) Yarn standby RM taking long to transition to active

2014-04-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-1878:


Priority: Blocker  (was: Major)

> Yarn standby RM taking long to transition to active
> ---
>
> Key: YARN-1878
> URL: https://issues.apache.org/jira/browse/YARN-1878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1878.1.patch
>
>
> In our HA tests we are noticing that some times it can take upto 10s for the 
> standby RM to transition to active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM

2014-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960826#comment-13960826
 ] 

Hadoop QA commented on YARN-1898:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12638792/YARN-1898.addendum.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3514//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3514//console

This message is automatically generated.

> Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are 
> redirecting to Active RM
> -
>
> Key: YARN-1898
> URL: https://issues.apache.org/jira/browse/YARN-1898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Yesha Vora
>Assignee: Xuan Gong
> Fix For: 2.4.1
>
> Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, 
> YARN-1898.addendum.patch, YARN-1898.addendum.patch
>
>
> Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to 
> Active RM.
> It should not be redirected to Active RM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set

2014-04-04 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960809#comment-13960809
 ] 

Xuan Gong commented on YARN-1903:
-

+1 LGTM

Also, I run the TestNMClient with this patch applied on Windows several times. 
All of them are passed.

> Killing Container on NEW and LOCALIZING will result in exitCode and 
> diagnostics not set
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1903.1.patch
>
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1872) TestDistributedShell occasionally fails in trunk

2014-04-04 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960797#comment-13960797
 ] 

Hong Zhiguo commented on YARN-1872:
---

Yes. It is. And MapRedue V2 AM contains some code to work around for this 
strange behavior.
I'll review YARN-1902 patch later.
But anyway, it's better to move the check to inside the loop (What's done in 
this patch).


> TestDistributedShell occasionally fails in trunk
> 
>
> Key: YARN-1872
> URL: https://issues.apache.org/jira/browse/YARN-1872
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Hong Zhiguo
>Priority: Blocker
> Attachments: TestDistributedShell.out, YARN-1872.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console :
> TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and 
> TestDistributedShell#testDSShell timed out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set

2014-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960776#comment-13960776
 ] 

Hadoop QA commented on YARN-1903:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638788/YARN-1903.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3513//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3513//console

This message is automatically generated.

> Killing Container on NEW and LOCALIZING will result in exitCode and 
> diagnostics not set
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1903.1.patch
>
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-04-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960541#comment-13960541
 ] 

Jason Lowe commented on YARN-1769:
--

The patch no longer applies cleanly after YARN-1512.  Other comments on the 
patch:

- Nit: In LeafQueue.assignToQueue we could cache Resource.add(usedResources, 
required) in a local when we're computing potentialNewCapacity so we don't have 
to recompute it as part of the potentialNewWithoutReservedCapacity computation
- LeafQueue.assignToQueue and LeafQueue.assignToUser don't seem to need the new 
priority argument, and therefore LeafQueue.checkLimitsToReserve wouldn't seem 
to need it either once those others are updated.
- Should FiCaSchedulerApp getAppToUnreserve really be called 
getNodeIdToUnreserve or geNodeToUnreserve, since it's returning a node ID 
rather than an app?
- In LeafQueue.findNodeToUnreserve, isn't it kinda bad if the app thinks it has 
reservations on the node but the scheduler doesn't know about it?  Wondering if 
the bookkeeping is messed up at that point therefore something a bit more than 
debug is an appropriate log level and if further fixup is needed.
- LeafQueue.findNodeToUnreserve is adjusting the headroom when it unreserves, 
but I don't see other unreservations doing a similar calculation.  Wondering if 
this fixup is something that should have been in completedContainer or needs to 
be done elsewhere?  I could easily be missing something here but asking just in 
case other unreservation situations also need to have the headroom fixed.
- LeafQueue.assignContainer uses the much more expensive 
scheduler.getConfiguration().getReservationContinueLook() when it should be 
able to use the reservationsContinueLooking member instead.
- LeafQueue.getReservationContinueLooking should be package private
- Nit: LeafQueue.assignContainer has some reformatting of the log message after 
the "// Inform the node" comment which was clearer to read/maintain before 
since the label and the value were always on a line by themselves.  Same goes 
for the "Reserved container" log towards the end of the method.
- Ultra-Nit: ParentQueue.setupQueueConfig's log message should have the 
reservationsContinueLooking on the previous line to match the style of other 
label/value pairs in the log message.
- ParentQueue.getReservationContinueLooking should be package private.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM

2014-04-04 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1898:


Attachment: YARN-1898.addendum.patch

submit the same patch again to kill off the Jenkins

> Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are 
> redirecting to Active RM
> -
>
> Key: YARN-1898
> URL: https://issues.apache.org/jira/browse/YARN-1898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Yesha Vora
>Assignee: Xuan Gong
> Fix For: 2.4.1
>
> Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, 
> YARN-1898.addendum.patch, YARN-1898.addendum.patch
>
>
> Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to 
> Active RM.
> It should not be redirected to Active RM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1903:
--

Attachment: YARN-1903.1.patch

Upload a patch to fix these issues

> Killing Container on NEW and LOCALIZING will result in exitCode and 
> diagnostics not set
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1903.1.patch
>
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu resolved YARN-1901.
---

Resolution: Duplicate

> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960515#comment-13960515
 ] 

Fengdong Yu commented on YARN-1901:
---

Yes, exactly duplicated, thanks, I've closed it.

> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1903:
--

Summary: Killing Container on NEW and LOCALIZING will result in exitCode 
and diagnostics not set  (was: TestNMClient fails occasionally)

> Killing Container on NEW and LOCALIZING will result in exitCode and 
> diagnostics not set
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1903) TestNMClient fails occasionally

2014-04-04 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960480#comment-13960480
 ] 

Zhijie Shen commented on YARN-1903:
---

I did more investigation. Instead of a test failure, it sound more like a bug 
on container life cycle to me:

1. If a container is killed on NEW, the exit code and diagnostics will never be 
set.
2. If a container is killed on LOCALIZING, the exit code will never be set.

> TestNMClient fails occasionally
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1903) TestNMClient fails occasionally

2014-04-04 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960418#comment-13960418
 ] 

Zhijie Shen commented on YARN-1903:
---

I found the following log:
{code}
2014-04-04 05:08:01,361 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:getContainerStatusInternal(785)) - Returning 
ContainerStatus: [ContainerId: container_1396613275302_0001_01_04, State: 
RUNNING, Diagnostics: , ExitStatus: -1000, ]
2014-04-04 05:08:01,365 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:stopContainerInternal(718)) - Stopping container 
with container Id: container_1396613275302_0001_01_04
2014-04-04 05:08:01,366 INFO  nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(89)) - USER=jenkins  IP=10.79.62.28  
OPERATION=Stop Container RequestTARGET=ContainerManageImpl  
RESULT=SUCCESS  APPID=application_1396613275302_0001
CONTAINERID=container_1396613275302_0001_01_04
2014-04-04 05:08:01,387 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:isEnabled(169)) - Neither virutal-memory nor 
physical-memory monitoring is needed. Not running the monitor-thread
2014-04-04 05:08:01,387 INFO  containermanager.AuxServices 
(AuxServices.java:handle(175)) - Got event CONTAINER_STOP for appId 
application_1396613275302_0001
2014-04-04 05:08:01,389 INFO  application.Application 
(ApplicationImpl.java:transition(296)) - Adding 
container_1396613275302_0001_01_04 to application 
application_1396613275302_0001
2014-04-04 05:08:01,389 INFO  nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(89)) - USER=jenkins  OPERATION=Container 
Finished - Killed   TARGET=ContainerImplRESULT=SUCCESS  
APPID=application_1396613275302_0001
CONTAINERID=container_1396613275302_0001_01_04
2014-04-04 05:08:01,389 INFO  container.Container 
(ContainerImpl.java:handle(884)) - Container 
container_1396613275302_0001_01_04 transitioned from NEW to DONE
2014-04-04 05:08:01,389 INFO  application.Application 
(ApplicationImpl.java:transition(339)) - Removing 
container_1396613275302_0001_01_04 from application 
application_1396613275302_0001
2014-04-04 05:08:01,390 INFO  util.ProcfsBasedProcessTree 
(ProcfsBasedProcessTree.java:isAvailable(182)) - ProcfsBasedProcessTree 
currently is supported only on Linux.
2014-04-04 05:08:01,392 INFO  rmcontainer.RMContainerImpl 
(RMContainerImpl.java:handle(321)) - container_1396613275302_0001_01_04 
Container Transitioned from ACQUIRED to RUNNING
2014-04-04 05:08:01,393 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:getContainerStatusInternal(771)) - Getting 
container-status for container_1396613275302_0001_01_04
2014-04-04 05:08:01,393 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:getContainerStatusInternal(785)) - Returning 
ContainerStatus: [ContainerId: container_1396613275302_0001_01_04, State: 
COMPLETE, Diagnostics: , ExitStatus: -1000, ]
{code}

When the kill event is received, the container is still at NEW, it is moved to 
DONE by going through ContainerDoneTransition, which won't set the killing 
related exitcode and diagnostics.

> TestNMClient fails occasionally
> ---
>
> Key: YARN-1903
> URL: https://issues.apache.org/jira/browse/YARN-1903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> The container status after stopping container is not expected.
> {code}
> java.lang.AssertionError: 4: 
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1903) TestNMClient fails occasionally

2014-04-04 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1903:
-

 Summary: TestNMClient fails occasionally
 Key: YARN-1903
 URL: https://issues.apache.org/jira/browse/YARN-1903
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen


The container status after stopping container is not expected.
{code}
java.lang.AssertionError: 4: 
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.assertTrue(Assert.java:43)
at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382)
at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346)
at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1872) TestDistributedShell occasionally fails in trunk

2014-04-04 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960357#comment-13960357
 ] 

Zhijie Shen commented on YARN-1872:
---

bq. After the DistributedShell AM requested numTotalContainers containers, RM 
main allocate more than that.

[~zhiguohong], thanks for working on the test failure. Do you know why RM is 
likely to allocate more containers than AM requested? Is it related to what 
YARN-1902 described?

> TestDistributedShell occasionally fails in trunk
> 
>
> Key: YARN-1872
> URL: https://issues.apache.org/jira/browse/YARN-1872
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Hong Zhiguo
>Priority: Blocker
> Attachments: TestDistributedShell.out, YARN-1872.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console :
> TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and 
> TestDistributedShell#testDSShell timed out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1872) TestDistributedShell occasionally fails in trunk

2014-04-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1872:
--

Priority: Blocker  (was: Major)
Target Version/s: 2.4.1
  Labels:   (was: patch)

> TestDistributedShell occasionally fails in trunk
> 
>
> Key: YARN-1872
> URL: https://issues.apache.org/jira/browse/YARN-1872
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Hong Zhiguo
>Priority: Blocker
> Attachments: TestDistributedShell.out, YARN-1872.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console :
> TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and 
> TestDistributedShell#testDSShell timed out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails

2014-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960281#comment-13960281
 ] 

Hudson commented on YARN-1837:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5458 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5458/])
YARN-1837. Fixed TestMoveApplication#testMoveRejectedByScheduler failure. 
Contributed by Hong Zhiguo (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1584862)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java


> TestMoveApplication.testMoveRejectedByScheduler randomly fails
> --
>
> Key: YARN-1837
> URL: https://issues.apache.org/jira/browse/YARN-1837
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Tsuyoshi OZAWA
>Assignee: Hong Zhiguo
> Fix For: 2.4.1
>
> Attachments: YARN-1837.patch
>
>
> TestMoveApplication#testMoveRejectedByScheduler fails because of 
> NullPointerException. It looks caused by unhandled exception handling at 
> server-side.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails

2014-04-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960259#comment-13960259
 ] 

Jian He commented on YARN-1837:
---

One more observation is that move is allowed at submitted state? not sure 
that's expected or not.  Unrelevant  to this patch.

Checking this in.



> TestMoveApplication.testMoveRejectedByScheduler randomly fails
> --
>
> Key: YARN-1837
> URL: https://issues.apache.org/jira/browse/YARN-1837
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Tsuyoshi OZAWA
>Assignee: Hong Zhiguo
> Attachments: YARN-1837.patch
>
>
> TestMoveApplication#testMoveRejectedByScheduler fails because of 
> NullPointerException. It looks caused by unhandled exception handling at 
> server-side.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails

2014-04-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960255#comment-13960255
 ] 

Jian He commented on YARN-1837:
---

looks good to me, +1

> TestMoveApplication.testMoveRejectedByScheduler randomly fails
> --
>
> Key: YARN-1837
> URL: https://issues.apache.org/jira/browse/YARN-1837
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Tsuyoshi OZAWA
>Assignee: Hong Zhiguo
> Attachments: YARN-1837.patch
>
>
> TestMoveApplication#testMoveRejectedByScheduler fails because of 
> NullPointerException. It looks caused by unhandled exception handling at 
> server-side.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-04 Thread Sietse T. Au (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sietse T. Au updated YARN-1902:
---

Description: 
Regarding AMRMClientImpl

Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called 
z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Scenario 2:
No containers are started between the allocate calls. 

Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
are requested in both scenarios, but that only in the second scenario, the 
correct behavior is observed.

Looking at the implementation I have found that this (z+1) request is caused by 
the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information 
about whether a request has been sent to the RM yet or not.

There are workarounds for this, such as releasing the excess containers 
received.

The solution implemented is to initialize a new ResourceRequest in 
ResourceRequestInfo when a request has been successfully sent to the RM.

The patch includes a test in which scenario one is tested.

  was:
Regarding AMRMClientImpl

Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called 
z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Scenario 2:
No containers are started between the allocate calls. 

Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
are requested in both scenarios, but that only in the second scenario, the 
correct behavior is observed.

Looking at the implementation I have found that this (z+1) request is caused by 
the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information 
about whether a request has been sent to the RM yet or not.

There are workarounds for this, such as releasing the excess containers 
received.

The solution implemented is to initialize a new ResourceRequest in 
ResourceRequestInfo when a request has been successfully sent to the RM.




> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-04 Thread Sietse T. Au (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sietse T. Au updated YARN-1902:
---

Description: 
Regarding AMRMClientImpl

Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called 
z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Scenario 2:
No containers are started between the allocate calls. 

Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
are requested in both scenarios, but that only in the second scenario, the 
correct behavior is observed.

Looking at the implementation I have found that this (z+1) request is caused by 
the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information 
about whether a request has been sent to the RM yet or not.

There are workarounds for this, such as releasing the excess containers 
received.

The solution implemented is to initialize a new ResourceRequest in 
ResourceRequestInfo when a request has been successfully sent to the RM.



  was:
Regarding AMRMClientImpl

Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called 
z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Scenario 2:
This behavior does not occur when no containers are started between the 
allocate calls. 

Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
are requested in both scenarios, but that only in the second scenario, the 
correct behavior is observed.

Looking at the implementation I have found that this (z+1) request is caused by 
the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information 
about whether a request has been sent to the RM yet or not.

There are workarounds for this, such as releasing the excess containers 
received.

The solution implemented is to initialize a new ResourceRequest in 
ResourceRequestInfo when a request has been successfully sent to the RM.




> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-04 Thread Sietse T. Au (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sietse T. Au updated YARN-1902:
---

Attachment: YARN-1902.patch

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> This behavior does not occur when no containers are started between the 
> allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960026#comment-13960026
 ] 

Jason Lowe commented on YARN-1901:
--

This appears to be a duplicate of HIVE-6638.  As [~ozawa] mentioned, AMs are 
restarted when the RM restarts until YARN-556 is addressed.  When an AM 
restarts, it is not automatically the case that completed tasks will be 
recovered -- it must be supported by the output committer.  HIVE-6638 is 
updating Hive's OutputCommitter so it can support task recovery upon AM restart.

> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-04 Thread Sietse T. Au (JIRA)
Sietse T. Au created YARN-1902:
--

 Summary: Allocation of too many containers when a second request 
is done with the same resource capability
 Key: YARN-1902
 URL: https://issues.apache.org/jira/browse/YARN-1902
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.3.0, 2.2.0
Reporter: Sietse T. Au


Regarding AMRMClientImpl

Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called 
z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Scenario 2:
This behavior does not occur when no containers are started between the 
allocate calls. 

Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
are requested in both scenarios, but that only in the second scenario, the 
correct behavior is observed.

Looking at the implementation I have found that this (z+1) request is caused by 
the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information 
about whether a request has been sent to the RM yet or not.

There are workarounds for this, such as releasing the excess containers 
received.

The solution implemented is to initialize a new ResourceRequest in 
ResourceRequestInfo when a request has been successfully sent to the RM.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959815#comment-13959815
 ] 

Fengdong Yu commented on YARN-1901:
---

Hi [~oazwa],
Can you search the mail list of yarn-dev, I had a mail for this issue.

This issue is only for Hive jobs. It works well for general MR jobs.(only 
unfinished tasks restart, all finished tasks not re-run)


> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959731#comment-13959731
 ] 

Tsuyoshi OZAWA commented on YARN-1901:
--

Sorry, I typoed and it may cunfuse you.

The current hadoop supports: "RM can be able to continue running existing 
applications on cluster after the RM has been restarted. Clients should not 
have to re-submit currently running/submitted apps."

And work-preserving restart is under development on YARN-556. 

> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive

2014-04-04 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959726#comment-13959726
 ] 

Tsuyoshi OZAWA commented on YARN-1901:
--

Thank you for reporting, [~azuryy].  Currently, AM restarts after restarting 
RM. To address the problem, we have discussion under YARN-556. 

The current hadoop supports: "RM can be able to continue running existing 
applications on cluster after the RM has been restarted. " For more detail, 
please see the design note on YARN-128.
https://issues.apache.org/jira/secure/attachment/12552867/RMRestartPhase1.pdf

> All tasks restart during RM failover on Hive
> 
>
> Key: YARN-1901
> URL: https://issues.apache.org/jira/browse/YARN-1901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Fengdong Yu
>
> I built from trunk, and configured RM Ha, then I submitted a hive job.
> there are total 11 maps, then I stopped active RM when 6 maps finished.
> but Hive shows me all map tasks restat again. This is conflict with the 
> design description.
> job progress:
> {code}
> 2014-03-31 18:44:14,088 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 713.84 sec
> 2014-03-31 18:44:15,128 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 722.83 sec
> 2014-03-31 18:44:16,160 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 731.95 sec
>  2014-03-31 18:44:17,191 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 744.17 sec
> 2014-03-31 18:44:18,220 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 756.22 sec
> 2014-03-31 18:44:19,250 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 762.4 
> sec
>  2014-03-31 18:44:20,281 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
> 774.64 sec
> 2014-03-31 18:44:21,306 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 786.49 sec
> 2014-03-31 18:44:22,334 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
> 792.59 sec
>  2014-03-31 18:44:23,363 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
> 807.58 sec
> 2014-03-31 18:44:24,392 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
> 815.96 sec
> 2014-03-31 18:44:25,416 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 823.83 sec
>  2014-03-31 18:44:26,443 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
> 826.84 sec
> 2014-03-31 18:44:27,472 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 832.16 sec
> 2014-03-31 18:44:28,501 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 
> 839.73 sec
>  2014-03-31 18:44:29,531 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
> 844.45 sec
> 2014-03-31 18:44:30,564 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
> 760.34 sec
> 2014-03-31 18:44:31,728 Stage-1 map = 0%,  reduce = 0%
>  2014-03-31 18:45:06,918 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 
> 213.81 sec
> 2014-03-31 18:45:07,952 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 216.83 
> sec
> 2014-03-31 18:45:08,979 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 229.15 
> sec
>  2014-03-31 18:45:10,007 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
> 244.42 sec
> 2014-03-31 18:45:11,040 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
> 247.31 sec
> 2014-03-31 18:45:12,072 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 259.5 
> sec
>  2014-03-31 18:45:13,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 274.72 sec
> 2014-03-31 18:45:14,135 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 280.76 sec
> 2014-03-31 18:45:15,170 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 292.9 
> sec
>  2014-03-31 18:45:16,202 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 305.16 sec
> 2014-03-31 18:45:17,233 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 314.21 sec
> 2014-03-31 18:45:18,264 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 323.34 sec
>  2014-03-31 18:45:19,294 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 335.6 sec
> 2014-03-31 18:45:20,325 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 344.71 sec
> 2014-03-31 18:45:21,355 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 353.8 
> sec
>  2014-03-31 18:45:22,385 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 366.06 sec
> 2014-03-31 18:45:23,415 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 375.2 
> sec
> 2014-03-31 18:45:24,449 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
> 384.28 sec
> {code}
> I am using hive-0.12.0,  and ZKRMStateRoot as RM store class.  Hive using a 
> simple external table(only one column).



--
This message was sent by Atlassian JIRA
(v6.2#6252)