date:20150922

[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending

2015-09-22 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904033#comment-14904033
 ] 

Weiwei Yang commented on YARN-4165:
---

Thanks Jason, it doesn't look like the YARN-957, the reserved memory did less 
than node manager reports. 

> An outstanding container request makes all nodes to be reserved causing all 
> jobs pending
> 
>
> Key: YARN-4165
> URL: https://issues.apache.org/jira/browse/YARN-4165
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>
> We have a long running service in YARN, it has a outstanding container 
> request that YARN cannot satisfy (require more memory that nodemanager can 
> supply). Then YARN reserves all nodes for this application, when I submit 
> other jobs (require relative small memory that nodemanager can supply), all 
> jobs are pending because YARN skips scheduling containers on the nodes that 
> have been reserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending

2015-09-22 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904028#comment-14904028
 ] 

Weiwei Yang commented on YARN-4165:
---

Hello Jason 

Thanks for looking this. I checked YARN 957 but I think this is a different 
problem. 

I have 3 nodes

NM1 8G
NM2 8G
NM3 8G

I submitted an application, requires 4 containers and each of them relative big 
memory like 5G, its app master requires 1G, so RM fills 3 containers and 1 app 
master but leaving 1 outstanding request, *unexpectedly* RM reserved 1 
container on all 3 nodes like

NM1 - 1 container, 1 app master - 6G used - 2G left - 5G reserved 
NM2 - 1 container - 5G used - 3G left - 5G reserved
NM3 - 1 container - 5G used - 3G left - 5G reserved

I am not sure yet why we run into such situation, but it might be related to 
YARN-1769, I am still investigating, if you have any pointers or comments, 
please let me know. Thanks.


> An outstanding container request makes all nodes to be reserved causing all 
> jobs pending
> 
>
> Key: YARN-4165
> URL: https://issues.apache.org/jira/browse/YARN-4165
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>
> We have a long running service in YARN, it has a outstanding container 
> request that YARN cannot satisfy (require more memory that nodemanager can 
> supply). Then YARN reserves all nodes for this application, when I submit 
> other jobs (require relative small memory that nodemanager can supply), all 
> jobs are pending because YARN skips scheduling containers on the nodes that 
> have been reserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903979#comment-14903979
 ] 

Varun Saxena commented on YARN-4075:


bq. With proper overloading I don't see significant code duplication problems.
Yeah there aren't too many code duplication problems. Its just that the methods 
have so many parameters that this itself looks quite big. Anyways thats not a 
major issue.

bq. TimelineEntities is not only used by the reader. It is also used by the 
writer and aggregation logic. Enforcing an order on this class will introduce 
unnecessary overhead to both writers and aggregators. If the reader needs it, 
we should derive it and make an ordered version, if possible.
Yes thats a fair point. We can use setEntities on the read path. Explicitly 
defining a derived class for readers might be an option as well(say something 
like SortedTimelineEntities). I think we should make this return type explicit 
so that reader implementations use it.
 

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2015-09-22 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903893#comment-14903893
 ] 

Naganarasimha G R commented on YARN-1994:
-

Thanks for the explanation [~arpitagarwal],
Even I had the same feeling that boolean configuration like NM_BIND_WILDCARD 
would have sufficed for it but i thought i might be missing something hence 
asked the query.
May be we can capture the explanation you gave in the documentation jira 
YARN-2384  too ?

> Expose YARN/MR endpoints on multiple interfaces
> ---
>
> Key: YARN-1994
> URL: https://issues.apache.org/jira/browse/YARN-1994
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager, webapp
>Affects Versions: 2.4.0
>Reporter: Arpit Agarwal
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
> YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
> YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, 
> YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, 
> YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch
>
>
> YARN and MapReduce daemons currently do not support specifying a wildcard 
> address for the server endpoints. This prevents the endpoints from being 
> accessible from all interfaces on a multihomed machine.
> Note that if we do specify INADDR_ANY for any of the options, it will break 
> clients as they will attempt to connect to 0.0.0.0. We need a solution that 
> allows specifying a hostname or IP-address for clients while requesting 
> wildcard bind for the servers.
> (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4157) Merge YARN-1197 back to trunk

2015-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903832#comment-14903832
 ] 

Hadoop QA commented on YARN-4157:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  24m 29s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 59 new or modified test files. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  2s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   3m 41s | The applied patch generated  7 
new checkstyle issues (total was 29, now 27). |
| {color:red}-1{color} | whitespace | 284m 14s | The patch has 180  line(s) 
that end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   9m 42s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 34s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | tools/hadoop tests |   0m 52s | Tests passed in 
hadoop-sls. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   6m 55s | Tests passed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   2m  2s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |   8m 27s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |  55m 49s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 428m 10s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761700/YARN-1197.diff.6.patch 
|
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | trunk / cc2b473 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/whitespace.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-sls test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-sls.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9234/console |


This message was automatically generated.

> Merge YARN-1197 back to trunk
> -
>
> Key: YARN-4157
> URL: https://issues.apache.org/jira/browse/YARN-4157
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-1197.diff.1.patch, YARN-1197.diff.2.patch, 
> YARN-1197.diff.3.patch, YARN-1197.diff.4.patch, YARN-1197.diff.5.patch, 
> YARN-1197.diff.6.patch
>
>
> The purpose of this jira is to generate a uber patch from c

[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-09-22 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903754#comment-14903754
 ] 

Naganarasimha G R commented on YARN-3367:
-

Thanks [~gtCarrera9], 
For looking into this, There were lot of open questions for this jira from my 
end which i mentioned 
[earlier|https://issues.apache.org/jira/browse/YARN-3367?focusedCommentId=14732065&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14732065].
 Hence just started with initial approach path. Mainly to use the refactor and 
reuse the existing {{AsyncDispatcher}} we need to discuss on below specific 
points : 
{quote}
* 3 Is it important to maintain the order of events which are sent from sync 
and async ? i.e. Is it req to ensure all the async events are also pushed along 
with the current sync event or is it ok to send only the sync ? (current patch 
just ensures async events are in order) .
* 4 Whether its req to merge entities of multiple async calls as they belong to 
same application ?
{quote}
If concluded then i can analyze further and inform.


> Replace starting a separate thread for post entity with event loop in 
> TimelineClient
> 
>
> Key: YARN-3367
> URL: https://issues.apache.org/jira/browse/YARN-3367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Junping Du
>Assignee: Naganarasimha G R
> Attachments: YARN-3367.YARN-2928.001.patch
>
>
> Since YARN-3039, we add loop in TimelineClient to wait for 
> collectorServiceAddress ready before posting any entity. In consumer of  
> TimelineClient (like AM), we are starting a new thread for each call to get 
> rid of potential deadlock in main thread. This way has at least 3 major 
> defects:
> 1. The consumer need some additional code to wrap a thread before calling 
> putEntities() in TimelineClient.
> 2. It cost many thread resources which is unnecessary.
> 3. The sequence of events could be out of order because each posting 
> operation thread get out of waiting loop randomly.
> We should have something like event loop in TimelineClient side, 
> putEntities() only put related entities into a queue of entities and a 
> separated thread handle to deliver entities in queue to collector via REST 
> call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities

2015-09-22 Thread Shiwei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903729#comment-14903729
 ] 

Shiwei Guo commented on YARN-4199:
--

Sorry, I havn't noticed 
[YARN-3448|https://issues.apache.org/jira/browse/YARN-3448] before. I think 
[YARN-3448|https://issues.apache.org/jira/browse/YARN-3448] solved the problem 
in a better way. So I marked this issue as a duplicate to 
[YARN-3448|https://issues.apache.org/jira/browse/YARN-3448]. 

Thanks for your remind.

> Minimize lock time in LeveldbTimelineStore.discardOldEntities
> -
>
> Key: YARN-4199
> URL: https://issues.apache.org/jira/browse/YARN-4199
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Reporter: Shiwei Guo
>
> In current implementation, LeveldbTimelineStore.discardOldEntities holds a 
> writeLock on deleteLock, which will block other put operation, which 
> eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of 
> history jobs in timelinestore, the block time will be very long. In our 
> observation, it block all the TEZ jobs for several hours or longer. 
> The possible solutions are:
> - Optimize leveldb configuration,  so a full scan won't take long time.
> - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
> lock while getSnapshot. One question is that whether snapshot will take long 
> time or not, cause I have no experience with leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities

2015-09-22 Thread Shiwei Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiwei Guo resolved YARN-4199.
--
Resolution: Duplicate

> Minimize lock time in LeveldbTimelineStore.discardOldEntities
> -
>
> Key: YARN-4199
> URL: https://issues.apache.org/jira/browse/YARN-4199
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Reporter: Shiwei Guo
>
> In current implementation, LeveldbTimelineStore.discardOldEntities holds a 
> writeLock on deleteLock, which will block other put operation, which 
> eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of 
> history jobs in timelinestore, the block time will be very long. In our 
> observation, it block all the TEZ jobs for several hours or longer. 
> The possible solutions are:
> - Optimize leveldb configuration,  so a full scan won't take long time.
> - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
> lock while getSnapshot. One question is that whether snapshot will take long 
> time or not, cause I have no experience with leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2015-09-22 Thread Arpit Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903614#comment-14903614
 ] 

Arpit Agarwal commented on YARN-1994:
-

bq. Is it assumed that NM_BIND_HOST is configured to specific IP then 
NM_ADDRESS is also configured to the same IP ?
Hi [~Naganarasimha], if NM_BIND_HOST is an IP address other than 0.0.0.0, then 
NM_ADDRESS should be set to a host that resolves to that address. Think of 
NM_BIND_HOST as the server side setting and NM_ADDRESS as a client side 
setting. If they are different the client cannot connect.

I don't think we have tested setting NM_BIND_HOST to anything other than 
0.0.0.0. In hindsight it may have been simpler to expose a boolean setting like 
NM_BIND_WILDCARD.

bq. May be this a layman question why is it required to bind to all/multiple 
interfaces ?
Depending on the routing and DNS configs, the client may connect on a different 
interface than the one bound by the server. Listening on all interfaces ensures 
connectivity.

> Expose YARN/MR endpoints on multiple interfaces
> ---
>
> Key: YARN-1994
> URL: https://issues.apache.org/jira/browse/YARN-1994
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager, webapp
>Affects Versions: 2.4.0
>Reporter: Arpit Agarwal
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
> YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
> YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, 
> YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, 
> YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch
>
>
> YARN and MapReduce daemons currently do not support specifying a wildcard 
> address for the server endpoints. This prevents the endpoints from being 
> accessible from all interfaces on a multihomed machine.
> Note that if we do specify INADDR_ANY for any of the options, it will break 
> clients as they will attempt to connect to 0.0.0.0. We need a solution that 
> allows specifying a hostname or IP-address for clients while requesting 
> wildcard bind for the servers.
> (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled

2015-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903569#comment-14903569
 ] 

Jason Lowe commented on YARN-3975:
--

Latest patch looks good to me, however it does not apply cleanly to branch-2.7. 
 Could you provide a branch-2.7 patch as well?

> WebAppProxyServlet should not redirect to RM page if AHS is enabled
> ---
>
> Key: YARN-3975
> URL: https://issues.apache.org/jira/browse/YARN-3975
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, 
> YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, 
> YARN-3975.8.patch, YARN-3975.9.patch
>
>
> WebAppProxyServlet should be updated to handle the case when the appreport 
> doesn't have a tracking URL and the Application History Server is eanbled.
> As we would have already tried the RM and got the 
> ApplicationNotFoundException we should not direct the user to the RM app page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903527#comment-14903527
 ] 

Li Lu commented on YARN-3367:
-

Also, I looked at the patch. One general comment is that, the logic of 
{{TimelineEntityAsyncDispatcher}} is pretty similar to {{AsyncDispatcher}}. 
Since the code segments that handling concurrency is normally considered as 
non-trivial, maybe we should refactor {{AsycnDispatcher}}'s code and reuse it, 
rather than follow the logic here? Will there be any unforeseen challenges on 
this? Thanks! 

> Replace starting a separate thread for post entity with event loop in 
> TimelineClient
> 
>
> Key: YARN-3367
> URL: https://issues.apache.org/jira/browse/YARN-3367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Junping Du
>Assignee: Naganarasimha G R
> Attachments: YARN-3367.YARN-2928.001.patch
>
>
> Since YARN-3039, we add loop in TimelineClient to wait for 
> collectorServiceAddress ready before posting any entity. In consumer of  
> TimelineClient (like AM), we are starting a new thread for each call to get 
> rid of potential deadlock in main thread. This way has at least 3 major 
> defects:
> 1. The consumer need some additional code to wrap a thread before calling 
> putEntities() in TimelineClient.
> 2. It cost many thread resources which is unnecessary.
> 3. The sequence of events could be out of order because each posting 
> operation thread get out of waiting loop randomly.
> We should have something like event loop in TimelineClient side, 
> putEntities() only put related entities into a queue of entities and a 
> separated thread handle to deliver entities in queue to collector via REST 
> call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4169) jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels

2015-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903483#comment-14903483
 ] 

Hadoop QA commented on YARN-4169:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  10m 26s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   8m 54s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 35s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 3  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 36s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 41s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   2m  9s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |   7m 48s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |  56m 20s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  95m 35s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761714/YARN-4169.v1.001.patch 
|
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / cc2b473 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9237/console |


This message was automatically generated.

> jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels
> -
>
> Key: YARN-4169
> URL: https://issues.apache.org/jira/browse/YARN-4169
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4169.v1.001.patch
>
>
> Test failing in [[Jenkins build 
> 402|https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-Yarn-trunk-Java8/402/testReport/junit/org.apache.hadoop.yarn.server.nodemanager/TestNodeStatusUpdaterForLabels/testNodeStatusUpdaterForNodeLabels/]
> {code}
> java.lang.NullPointerException: null
>   at java.util.HashSet.(HashSet.java:118)
>   at 
> org.apache.hadoop.yarn.nodelabels.NodeLabelTestBase.assertNLCollectionEquals(NodeLabelTestBase.java:103)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels.testNodeStatusUpdaterForNodeLabels(TestNodeStatusUpdaterForLabels.java:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903461#comment-14903461
 ] 

Li Lu commented on YARN-3367:
-

Hi [~Naganarasimha], I'm trying to go over all pending JIRAs for 2928 branch, 
and seems like we're close on this one? Any recent progress on this JIRA? 
Thanks! 

> Replace starting a separate thread for post entity with event loop in 
> TimelineClient
> 
>
> Key: YARN-3367
> URL: https://issues.apache.org/jira/browse/YARN-3367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Junping Du
>Assignee: Naganarasimha G R
> Attachments: YARN-3367.YARN-2928.001.patch
>
>
> Since YARN-3039, we add loop in TimelineClient to wait for 
> collectorServiceAddress ready before posting any entity. In consumer of  
> TimelineClient (like AM), we are starting a new thread for each call to get 
> rid of potential deadlock in main thread. This way has at least 3 major 
> defects:
> 1. The consumer need some additional code to wrap a thread before calling 
> putEntities() in TimelineClient.
> 2. It cost many thread resources which is unnecessary.
> 3. The sequence of events could be out of order because each posting 
> operation thread get out of waiting loop randomly.
> We should have something like event loop in TimelineClient side, 
> putEntities() only put related entities into a queue of entities and a 
> separated thread handle to deliver entities in queue to collector via REST 
> call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-22 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903411#comment-14903411
 ] 

Vrushali C commented on YARN-4074:
--

Committed patch v8. Thanks [~sjlee0] for the contribution and everyone for the 
review! 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903400#comment-14903400
 ] 

Varun Saxena commented on YARN-4000:


bq. Is this the case? I think in current code, RM is still ignoring these 
orphan containers?
In recoverContainersOnNode, if we do not find application in scheduler the flow 
in RM if I look at trunk code is as under:
# AbstractYarnScheduler#killOrphanContainerOnNode will be called if application 
is not found in scheduler, which will in turn post CLEANUP_CONTAINER event (for 
containers which have not finished). This event will be handled by RMNodeImpl. 
Although here we will be sending one CLEANUP_CONTAINER event for each container 
even though all containers for a running app will have to be cleaned up. Maybe 
this can be refactored to send one event only with all the containers for an 
app and node. But cleaning up a lot of containers like this maybe a rare 
scenario.
# Anyways going further, in RMNodeImpl, this event will be processed in 
CleanUpContainerTransition. Here the container will be added to a set 
containersToClean.
# When heartbeat from NM comes, ResourceTrackerService#nodeHeartbeat will call 
RMNodeImpl#updateNodeHeartbeatResponseForCleanup. In this method, response will 
be populated with containers to cleanup from the set containersToClean. And 
hence these containers are reported back to NM in HB Rsp.

On NM side, flow is as under:
# In NodeStatusUpdaterImpl, these containers to cleanup will be retrieved from 
HB Rsp and CMgrCompletedContainersEvent will be dispatched.
# In ContainerManagerImpl, this event will be processed and a 
ContainerKillEvent created for each container. 
# Now depending on the state of the container, ContainerImpl will send a 
CLEANUP_CONTAINER event to ContainersLauncher which will then send a TERM/KILL 
signal to container. 

> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903396#comment-14903396
 ] 

Li Lu commented on YARN-4075:
-

bq. This can lead to code bloating with similar methods repeating again and 
again. 
OK here are the possible solutions: 
- Having two endpoints, one with a cluster name and the other don't. Both 
methods will be redirected to the same internal method getFlows(clusterId). For 
the endpoint that does not have a cluster id, we can help figure out on the 
server side. With proper overloading I don't see significant code duplication 
problems. 
Or
- Always requiring the cluster id. Then, to allow the web apps to figure out 
the cluster name, we have to either implement another "client" in javascript, 
or we let the user input the cluster name (because the web app cannot figure 
them out). The first approach is not introducing any duplicated code, but 
actually is introducing duplicated logic, even in two different programming 
languages. The second approach will cause usability problems. 

Am I missing anything here?

bq. Well currently the entities are returned in order, sorted by created time. 
That is how we have documented our reader API as well. 
TimelineReader#getEntities is supposed to return entities sorted descendingly 
by created time. 
TimelineEntities is not only used by the reader. It is also used by the writer 
and aggregation logic. Enforcing an order on this class will introduce 
unnecessary overhead to both writers and aggregators. If the reader needs it, 
we should derive it and make an ordered version, if possible. 

bq. You want me to do this refactoring in this JIRA ?
To be clear, I'm only asking for Private and VisibleForTesting annotations in 
this JIRA. We should refactor the UTs in the JIRA that fully disables fs in 
future. 

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903387#comment-14903387
 ] 

Varun Saxena commented on YARN-4000:


[~jianhe]

bq. actually, I think this will be a problem in regular case. Application is 
being killed by user right on RM restart. This is an existing problem though. 
Do you think so ?
You mean user killing the application and we killing the application too at the 
same time ? But RM will first do the recovery and then only open any of the 
ports while transitioning to active. So ClientRMService or 
ResourceTrackerService wont even start till recovery is done. So most probably 
by the time kill from user comes, all the recovery related events should be 
processed. Even if they are not processed, they will be ahead in the dispatcher 
queue. A KILL event if app is already KILLING would be ignored by RMAppImpl.



> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled

2015-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903385#comment-14903385
 ] 

Hadoop QA commented on YARN-3975:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  3s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 51s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 12s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 51s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 39s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 54s | Tests passed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-web-proxy. |
| | |  47m 26s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761709/YARN-3975.9.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cc2b473 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9235/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-web-proxy test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9235/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9235/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9235/console |


This message was automatically generated.

> WebAppProxyServlet should not redirect to RM page if AHS is enabled
> ---
>
> Key: YARN-3975
> URL: https://issues.apache.org/jira/browse/YARN-3975
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, 
> YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, 
> YARN-3975.8.patch, YARN-3975.9.patch
>
>
> WebAppProxyServlet should be updated to handle the case when the appreport 
> doesn't have a tracking URL and the Application History Server is eanbled.
> As we would have already tried the RM and got the 
> ApplicationNotFoundException we should not direct the user to the RM app page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903374#comment-14903374
 ] 

Hadoop QA commented on YARN-4140:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 44s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m  7s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 10s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 49s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 29s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  54m 23s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  96m 15s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761696/0009-YARN-4140.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cc2b473 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9233/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9233/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9233/console |


This message was automatically generated.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showReque

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903363#comment-14903363
 ] 

Varun Saxena commented on YARN-4075:


{quote}
Or we can have 2 separate REST endpoints, with and without cluster ID.
This looks good to me. Are there any specific challenges to implement this?
{quote}
This can lead to code bloating with similar methods repeating again and again. 
Otherwise no other concern. I had tried doing a regex match for paths at the 
time. But that doesn't seem to work if matching path is not in the end which in 
case of cluster ID wont be.

{quote}
Let's not enforce an order by default since this may be slightly more 
expensive? The programmer can always sort them on the client side if needed.
{quote}
Well currently the entities are returned in order, sorted by created time. That 
is how we have documented our reader API as well. TimelineReader#getEntities is 
supposed to return entities sorted descendingly by created time. We will be 
breaking this behavior if we use TimelineEntities and do not change the set 
within.

{quote}
However we do have problems with supporting more features with the old fs 
storage, so yes it's fine to make the change here. Maybe we'd like to mark them 
as test only?
{quote}
You want me to do this refactoring in this JIRA ? I think I can handle this 
refactoring alongside some other JIRA. And we can get this in ASAP for UI 
related work. 


Will rebase it and update the patch tomorrow morning India time. I think 4074 
should be in by then. 


> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903342#comment-14903342
 ] 

Li Lu commented on YARN-4075:
-

Hi [~varun_saxena], thanks for the note! 

bq. Infact in initial patches in YARN-3814 I was taking cluster ID from config 
if it was not supplied by user i.e. it was an optional query parameter. But 
Zhijie was of the opinion that this handling should be done at TimelineClient 
side and that is what seemed to be the consensus at that time. Hence I removed 
it.
Sure. However at that time we did not think about web apps. It will be a little 
bit non-trivial for front-end web page to figure out which cluster it's 
pointing to without user input. I noticed in the discussion you raised a quite 
helpful point:
bq. Or we can have 2 separate REST endpoints, with and without cluster ID.
This looks good to me. Are there any specific challenges to implement this? 

bq. For ordering should we change the set inside TimelineEntities to TreeSet 
with comparator based on created time ? 
Let's not enforce an order by default since this may be slightly more 
expensive? The programmer can always sort them on the client side if needed. 

bq. I plan to combine to test webservices classes to use HBase
The UTs on webservices should be independent from the storage implementations. 
However we do have problems with supporting more features with the old fs 
storage, so yes it's fine to make the change here. Maybe we'd like to mark them 
as test only? 

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903329#comment-14903329
 ] 

Varun Saxena commented on YARN-4075:


I mean "I plan to combine two of the test webservices classes to use HBase"

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903305#comment-14903305
 ] 

Varun Saxena commented on YARN-4075:


[~gtCarrera9], thanks for the review.

bq. Maybe we'd like to return the "default" cluster, or the cluster the reader 
runs on (or a reader farm associates to), if the given clusterId is empty?
Infact in initial patches in YARN-3814 I was taking cluster ID from config if 
it was not supplied by user i.e. it was an optional query parameter. But Zhijie 
was of the opinion that this handling should be done at TimelineClient side and 
that is what seemed to be the consensus at that time. Hence I removed it. If 
the consensus now seems to be centering around handling it at server side, we 
can do that. I am fine either ways.

bq. I just noticed that we're returning Set rather than 
TimelineEntities in timeline reader.
Ok. Again I had initially kept the API as returning TimelineEntities in 3051 
but opinion differed then. I would infact prefer using TimelineEntities. For 
ordering should we change the set inside TimelineEntities to TreeSet with 
comparator based on created time ? Ordering might be useful at the client side.

bq. In TestTimelineReaderWebServicesFlowRun#testGetFlowRun, why do we compare 
equality through toString and comparing two strings
For the sake of simplicity. The toString outputs values as well. Anyways I can 
write a static function in the test class as well to do the comparison if 
toString approach seems confusing. Seems to be the case. Will change it.

bq. Any special reasons to refactor TestHBaseTimelineStorage
Due to visibility of TimelineSchemaCreator#createAllTables. Saw no real need to 
make it public. WebServices related test class shouldn't really need to access 
it directly. As I said in one of the comments above, I plan to combine to test 
webservices classes to use HBase. For that I will have a Test class for hbase 
reader implementation which will create the tables and load the data (in some 
beforeclass method). Webservices class will merely call that. Same arrangement 
as the one which exists for TestTimelineReaderWebServices and 
TestFileSystemTimelineReaderImpl. Then I wont need to call createAllTables from 
this test class. Will do that refactoring in some reader related JIRA. At the 
time, getting in this JIRA was more important than this refactoring, which 
anyways is just for tests. 




> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903285#comment-14903285
 ] 

Li Lu commented on YARN-4200:
-

Will flip the code quickly when there is no other interference. Right now our 
priority goes to YARN-4075. 

> Refactor reader classes in storage to nest under hbase specific package name
> 
>
> Key: YARN-4200
> URL: https://issues.apache.org/jira/browse/YARN-4200
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Li Lu
>Priority: Minor
>
> As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code 
> to group together the reader classes under a package in storage that 
> indicates these are hbase specific. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903284#comment-14903284
 ] 

Li Lu commented on YARN-4074:
-

Sure, please go ahead with the current patch. Thanks for the work folks! 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name

2015-09-22 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu reassigned YARN-4200:
---

Assignee: Li Lu

> Refactor reader classes in storage to nest under hbase specific package name
> 
>
> Key: YARN-4200
> URL: https://issues.apache.org/jira/browse/YARN-4200
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Li Lu
>Priority: Minor
>
> As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code 
> to group together the reader classes under a package in storage that 
> indicates these are hbase specific. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-22 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903246#comment-14903246
 ] 

Vrushali C commented on YARN-4074:
--

Chatted with Li offline and decided to file 
https://issues.apache.org/jira/browse/YARN-4200 to deal with the refactoring of 
package names and proceed with this patch. 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4169) jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels

2015-09-22 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-4169:

Attachment: YARN-4169.v1.001.patch

Hi [~ste...@apache.org],
   Was able to reproduce this test failure (NPE) during debug testing, and it 
was due to improper handling of race condition, basically after sending 
{{hearbeat}} we need to wait for a short duration in the test case for the HB 
thread in Node status updater, till it goes into wait state. If some sleep is 
not added then notify of {{sendOutofBandHeartBeat}} will be called even before 
HB thread  goes to wait state. Also have corrected other review comments which 
you had mentioned.

> jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels
> -
>
> Key: YARN-4169
> URL: https://issues.apache.org/jira/browse/YARN-4169
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4169.v1.001.patch
>
>
> Test failing in [[Jenkins build 
> 402|https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-Yarn-trunk-Java8/402/testReport/junit/org.apache.hadoop.yarn.server.nodemanager/TestNodeStatusUpdaterForLabels/testNodeStatusUpdaterForNodeLabels/]
> {code}
> java.lang.NullPointerException: null
>   at java.util.HashSet.(HashSet.java:118)
>   at 
> org.apache.hadoop.yarn.nodelabels.NodeLabelTestBase.assertNLCollectionEquals(NodeLabelTestBase.java:103)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels.testNodeStatusUpdaterForNodeLabels(TestNodeStatusUpdaterForLabels.java:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name

2015-09-22 Thread Vrushali C (JIRA)

Vrushali C created YARN-4200:


 Summary: Refactor reader classes in storage to nest under hbase 
specific package name
 Key: YARN-4200
 URL: https://issues.apache.org/jira/browse/YARN-4200
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C
Priority: Minor



As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code to 
group together the reader classes under a package in storage that indicates 
these are hbase specific. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled

2015-09-22 Thread Mit Desai (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-3975:

Attachment: YARN-3975.9.patch

Somehow attached a wrong version of patch previously. Attached the patch with 
checkstyle fixed.

> WebAppProxyServlet should not redirect to RM page if AHS is enabled
> ---
>
> Key: YARN-3975
> URL: https://issues.apache.org/jira/browse/YARN-3975
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, 
> YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, 
> YARN-3975.8.patch, YARN-3975.9.patch
>
>
> WebAppProxyServlet should be updated to handle the case when the appreport 
> doesn't have a tracking URL and the Application History Server is eanbled.
> As we would have already tried the RM and got the 
> ApplicationNotFoundException we should not direct the user to the RM app page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-22 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4180:

Attachment: YARN-4180.002.patch

Addressed feedback

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch, YARN-4180.002.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-22 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903217#comment-14903217
 ] 

Anubhav Dhoot commented on YARN-4180:
-

The test failure looks unrelated. 

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4141) Runtime Application Priority change should not throw exception for applications at finishing states

2015-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903168#comment-14903168
 ] 

Jason Lowe commented on YARN-4141:
--

Thanks for updating the patch.  The new constants should be marked final.  Also 
using "active" instead of "accepted" may be a bit more clear since accepted 
directly maps to an existing app state.


> Runtime Application Priority change should not throw exception for 
> applications at finishing states
> ---
>
> Key: YARN-4141
> URL: https://issues.apache.org/jira/browse/YARN-4141
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4141.patch, 0002-YARN-4141.patch, 
> 0003-YARN-4141.patch, 0004-YARN-4141.patch, 0005-YARN-4141.patch
>
>
> As suggested by [~jlowe] in 
> [MAPREDUCE-5870-comment|https://issues.apache.org/jira/browse/MAPREDUCE-5870?focusedCommentId=14737035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737035]
>  , its good that if YARN can suppress exceptions during change application 
> priority calls for applications at its finishing stages.
> Currently it will be difficult for clients to handle this. This will be 
> similar to kill application behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4157) Merge YARN-1197 back to trunk

2015-09-22 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4157:
-
Attachment: YARN-1197.diff.6.patch

Rebased to latest trunk (diff.6)

> Merge YARN-1197 back to trunk
> -
>
> Key: YARN-4157
> URL: https://issues.apache.org/jira/browse/YARN-4157
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-1197.diff.1.patch, YARN-1197.diff.2.patch, 
> YARN-1197.diff.3.patch, YARN-1197.diff.4.patch, YARN-1197.diff.5.patch, 
> YARN-1197.diff.6.patch
>
>
> The purpose of this jira is to generate a uber patch from current YARN-1197 
> branch and run against trunk to fix any uncaught warnings and test failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903162#comment-14903162
 ] 

Li Lu commented on YARN-4075:
-

Sorry folks we're a little bit delayed on YARN-4074, but once that is in we can 
move forward with this JIRA quickly. 

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-22 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903158#comment-14903158
 ] 

Li Lu commented on YARN-4074:
-

Sorry I missed your message yesterday... I was thinking about putting those 
hbase reader classes (like ApplicationEntityReader) to a sub dir to indicate 
they only work with HBase. It's also fine to commit the patch as-is if that's 
troublesome. I'm OK with both. 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-22 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903139#comment-14903139
 ] 

Sunil G commented on YARN-4113:
---

Thank you [~leftnoteasy] for the review and commit and thank you Karthik for 
the review. 

> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-22 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903133#comment-14903133
 ] 

Bibin A Chundatt commented on YARN-4176:


Hi [~leftnoteasy]

Could you please look into this issue

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism

2015-09-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903125#comment-14903125
 ] 

Wangda Tan commented on YARN-4189:
--

[~xinxianyin], I mentioned this in design doc:

bq. To avoid application set a very high delay (such as 10 min), we shall have 
a global max-container-delay to cap the delay to avoid resource wastage.

> Capacity Scheduler : Improve location preference waiting mechanism
> --
>
> Key: YARN-4189
> URL: https://issues.apache.org/jira/browse/YARN-4189
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4189 design v1.pdf
>
>
> There're some issues with current Capacity Scheduler implementation of delay 
> scheduling:
> *1) Waiting time to allocate each container highly depends on cluster 
> availability*
> Currently, app can only increase missed-opportunity when a node has available 
> resource AND it gets traversed by a scheduler. There’re lots of possibilities 
> that an app doesn’t get traversed by a scheduler, for example:
> A cluster has 2 racks (rack1/2), each rack has 40 nodes. 
> Node-locality-delay=40. An application prefers rack1. 
> Node-heartbeat-interval=1s.
> Assume there are 2 nodes available on rack1, delay to allocate one container 
> = 40 sec.
> If there are 20 nodes available on rack1, delay of allocating one container = 
> 2 sec.
> *2) It could violate scheduling policies (Fifo/Priority/Fair)*
> Assume a cluster is highly utilized, an app (app1) has higher priority, it 
> wants locality. And there’s another app (app2) has lower priority, but it 
> doesn’t care about locality. When node heartbeats with available resource, 
> app1 decides to wait, so app2 gets the available slot. This should be 
> considered as a bug that we need to fix.
> The same problem could happen when we use FIFO/Fair queue policies.
> Another problem similar to this is related to preemption: when preemption 
> policy preempts some resources from queue-A for queue-B (queue-A is 
> over-satisfied and queue-B is under-satisfied). But queue-B is waiting for 
> the node-locality-delay so queue-A will get resources back. In next round, 
> preemption policy could preempt this resources again from queue-A.
> This JIRA is target to solve these problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0009-YARN-4140.patch

Hi [~sunilg]

Thnks for review and comments.
Have updated restcases and patch as per comments

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-22 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903096#comment-14903096
 ] 

Robert Kanter commented on YARN-4180:
-

+1 after doing those.

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-22 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902902#comment-14902902
 ] 

Joep Rottinghuis commented on YARN-4075:


Agreed with comments from [~gtCarrera]

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-09-22 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902808#comment-14902808
 ] 

Varun Saxena commented on YARN-2902:


Ok. Then what I will do is NOT wait for completion of running tasks which have 
been cancelled. 
Localizer will try to only delete directories for which download was complete. 
Tasks which have failed, directories for them will anyways be deleted by 
FSDownload.

We however may need a config in NM for deletion task delay(the one I have added 
in current patch). Or we can simply have a hardcoded value of 2 minutes.
Regarding System exit, it will called after ExecutorService#shutdownNow(which 
will only interrupt running tasks and not wait for them) anyways.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending

2015-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902771#comment-14902771
 ] 

Jason Lowe commented on YARN-4165:
--

Something must be amiss then, since the capacity scheduler should not be making 
reservations when the node has insufficient memory to ever fill the request 
after YARN-957.

As for reservations in general, the capacity scheduler applies reservations 
against the user limits within the queue.  If the user has the ability to fully 
use the queue then yes, reservations can stall other applications within the 
queue since the user is allowed to fill the queue.  Without that behavior the 
application with large requests could end up in a situation where it never runs 
due to indefinite postponement problems.

> An outstanding container request makes all nodes to be reserved causing all 
> jobs pending
> 
>
> Key: YARN-4165
> URL: https://issues.apache.org/jira/browse/YARN-4165
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>
> We have a long running service in YARN, it has a outstanding container 
> request that YARN cannot satisfy (require more memory that nodemanager can 
> supply). Then YARN reserves all nodes for this application, when I submit 
> other jobs (require relative small memory that nodemanager can supply), all 
> jobs are pending because YARN skips scheduling containers on the nodes that 
> have been reserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities

2015-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902757#comment-14902757
 ] 

Jason Lowe commented on YARN-4199:
--

Have you looked at the rolling leveldb implementation from YARN-3448? One of 
its design goals was to solve this same problem.

> Minimize lock time in LeveldbTimelineStore.discardOldEntities
> -
>
> Key: YARN-4199
> URL: https://issues.apache.org/jira/browse/YARN-4199
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Reporter: Shiwei Guo
>
> In current implementation, LeveldbTimelineStore.discardOldEntities holds a 
> writeLock on deleteLock, which will block other put operation, which 
> eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of 
> history jobs in timelinestore, the block time will be very long. In our 
> observation, it block all the TEZ jobs for several hours or longer. 
> The possible solutions are:
> - Optimize leveldb configuration,  so a full scan won't take long time.
> - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
> lock while getSnapshot. One question is that whether snapshot will take long 
> time or not, cause I have no experience with leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902756#comment-14902756
 ] 

Hadoop QA commented on YARN-4140:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 41s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 16s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 25s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 49s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 28s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  57m 33s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  97m 18s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
|   | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761638/0008-YARN-4140.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 57003fa |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9232/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9232/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9232/console |


This message was automatically generated.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority:

[jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk

2015-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902750#comment-14902750
 ] 

Jason Lowe commented on YARN-4011:
--

bq. The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it 
is above the configured limit.
I think having the MR framework provide an optional limit for local filesystem 
output is a reasonable request until a more sophisticated solution can be 
implemented by YARN directly.

> Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
> 
>
> Key: YARN-4011
> URL: https://issues.apache.org/jira/browse/YARN-4011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.4.0
>Reporter: Ashwin Shankar
>
> We observed jobs failed since tasks couldn't launch on nodes due to 
> "java.io.IOException No space left on device". 
> On digging in further, we found a rogue job which filled up disk.
> Specifically it was wrote a lot of map spills(like 
> attempt_1432082376223_461647_m_000421_0_spill_1.out) to nm-local-dir 
> causing disk to fill up, and it failed/got killed, but didn't clean up these 
> files in nm-local-dir.
> So the disk remained full, causing subsequent jobs to fail.
> This jira is created to address why files under nm-local-dir doesn't get 
> cleaned up when job fails after filling up disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL

2015-09-22 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902709#comment-14902709
 ] 

Naganarasimha G R commented on YARN-4119:
-

Only problem i can see with the above approach is that, by default it binds to 
{{NM_WEBAPP_ADDRESS}} and not on all IPs which we thought of having as default 
behavior !

>  Expose the NM bind address as an env, so that AM can make use of it for 
> exposing tracking URL
> --
>
> Key: YARN-4119
> URL: https://issues.apache.org/jira/browse/YARN-4119
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As described in MAPREDUCE-5938, In many security scanning tools its not 
> advisable to bind on all network addresses and would be good to bind only on 
> the desired address. As AM's can run on any of the nodes it would be better 
> for NM to share its bind address as part of Environment variables to the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902662#comment-14902662
 ] 

Sunil G commented on YARN-4140:
---

Hi [~bibinchundatt]
Thanks for updating patch. 

Some minor nits:
1. {{incPendingResourcesForLabel}} and {{decPendingResourceForLabel}} need not 
have to take ResourceRequest as argument. Only label expression is to be passed 
along with resource.
2. In below code
{code}
  } else {
ResourceRequest anyRequest = getResourceRequest(priority,
ResourceRequest.ANY);
if (anyRequest != null) {
  request.setNodeLabelExpression(anyRequest.getNodeLabelExpression());
}
 }
{code}
for any other resource requests, label expression is set as from anyRequest. 
One of point here
 - If user is not specified any label expression, then also we forcefully set 
{{anyRequest.getNodeLabelExpression()}} in all requests. It can be null too. 
Such cases can be invalidated.

3. In testResourceRequestUpdateNodePartitions, before sending second changed AM 
resource request, could you also add few more NODE_LOCAL or RACK_LOCAL (some 
priority to ANY, and some after ANY). This can help in hitting some more areas 
in code.


> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.res

[jira] [Commented] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL

2015-09-22 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902631#comment-14902631
 ] 

Naganarasimha G R commented on YARN-4119:
-

Hi [~vvasudev] & [~rohithsharma],
  While trying to look into the modifications came across few things :
# {{ContainerLaunch.sanitizeEnv}} is already adding {{NM_HOST}} to the 
environment of a container Launch script. {{NM_HOST}} which is added as env is 
got from NM's NodeId.getHost(), NodeID is set in 
{{ContainerManagerImpl.serviceStart}} using {{yarn.nodemanager.address}}. So i 
was little skeptical about using this existing env param as though BindAddr is 
set they take NM_HOST's address.
# As per YARN-1994 {{NM_BIND_HOST}} is generally used to set {{0.0.0.0}} in a 
{{Multi homing/interface}} environment settings in server side. but user can 
set individual address too. So it would be ideal to expose this, but one 
concern what i have is what if this is not set ? As per my understanding we 
need to set address part of  {{NM_WEBAPP_ADDRESS/NM_WEBAPP_HTTPS_ADDRESS}} 
based on the schema. 

So my idea is 
* expose new ENV as {{AM_BIND_ADDR}} 
* It will be set with {{NM_BIND_HOST}} if its set
* if not set then {{NM_WEBAPP_ADDRESS/NM_WEBAPP_HTTPS_ADDRESS}} based on the 
schema.
Thoughts ? 



>  Expose the NM bind address as an env, so that AM can make use of it for 
> exposing tracking URL
> --
>
> Key: YARN-4119
> URL: https://issues.apache.org/jira/browse/YARN-4119
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As described in MAPREDUCE-5938, In many security scanning tools its not 
> advisable to bind on all network addresses and would be good to bind only on 
> the desired address. As AM's can run on any of the nodes it would be better 
> for NM to share its bind address as part of Environment variables to the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0008-YARN-4140.patch

Hi [~leftnoteasy]

Could you please review the patch attached.
When labels updated for *any* then pending resource usage for queue and app 
also need updated rt?. Have changed based on that .

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332

[jira] [Commented] (YARN-4141) Runtime Application Priority change should not throw exception for applications at finishing states

2015-09-22 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902503#comment-14902503
 ] 

Sunil G commented on YARN-4141:
---

Hi [~jlowe] and [~rohithsharma]
Could you please help to check the updated patch.

> Runtime Application Priority change should not throw exception for 
> applications at finishing states
> ---
>
> Key: YARN-4141
> URL: https://issues.apache.org/jira/browse/YARN-4141
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4141.patch, 0002-YARN-4141.patch, 
> 0003-YARN-4141.patch, 0004-YARN-4141.patch, 0005-YARN-4141.patch
>
>
> As suggested by [~jlowe] in 
> [MAPREDUCE-5870-comment|https://issues.apache.org/jira/browse/MAPREDUCE-5870?focusedCommentId=14737035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737035]
>  , its good that if YARN can suppress exceptions during change application 
> priority calls for applications at its finishing stages.
> Currently it will be difficult for clients to handle this. This will be 
> similar to kill application behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2015-09-22 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902369#comment-14902369
 ] 

Naganarasimha G R commented on YARN-1994:
-

Hi [~cwelch] & [~arpitagarwal],
I have few doubts for configuring NM_BIND_HOST & NM_ADDRESS as per the existing 
trunk/branch 2 code 
{code}
if (bindHost != null && !bindHost.isEmpty()
&& nmAddress != null && !nmAddress.isEmpty()) {
hostOverride = nmAddress.split(":")[0];
}
// setup node ID
InetSocketAddress connectAddress;
if (delayedRpcServerStart) {
  connectAddress = NetUtils.getConnectAddress(initialAddress);
} else {
  server.start();
  connectAddress = NetUtils.getConnectAddress(server);
}
NodeId nodeId = buildNodeId(connectAddress, hostOverride);
{code}
# IIUC  if NM_BIND_HOST is 0.0.0.0 then NM_ADDRESS's host part needs to be used 
for NODE_ID but what if proper IP is configured for NM_BIND_HOST then is it 
correct to take NM_ADDRESS's host part ? Is it assumed that NM_BIND_HOST is 
configured to specific IP then NM_ADDRESS is also configured to the same IP ?
# May be this a layman question why is it required to bind to all/multiple 
interfaces ?

> Expose YARN/MR endpoints on multiple interfaces
> ---
>
> Key: YARN-1994
> URL: https://issues.apache.org/jira/browse/YARN-1994
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager, webapp
>Affects Versions: 2.4.0
>Reporter: Arpit Agarwal
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
> YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
> YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, 
> YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, 
> YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch
>
>
> YARN and MapReduce daemons currently do not support specifying a wildcard 
> address for the server endpoints. This prevents the endpoints from being 
> accessible from all interfaces on a multihomed machine.
> Note that if we do specify INADDR_ANY for any of the options, it will break 
> clients as they will attempt to connect to 0.0.0.0. We need a solution that 
> allows specifying a hostname or IP-address for clients while requesting 
> wildcard bind for the servers.
> (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

53 matches

Mail list logo