[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-23 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598922#comment-14598922
 ] 

Carlo Curino commented on YARN-3656:


[~jyaniv] please address the checkstyle and whitespace -1 above. The rest is 
looking good.
[~subru] can you comment on the test failure? Is this something that is going 
to be addressed by the work on making the reservation subsystem HA? 

> LowCost: A Cost-Based Placement Agent for YARN Reservations
> ---
>
> Key: YARN-3656
> URL: https://issues.apache.org/jira/browse/YARN-3656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ishai Menache
>Assignee: Jonathan Yaniv
>  Labels: capacity-scheduler, resourcemanager
> Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
> YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf
>
>
> YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
> ahead of time. YARN-1710 introduced a greedy agent for placing user 
> reservations. The greedy agent makes fast placement decisions but at the cost 
> of ignoring the cluster committed resources, which might result in blocking 
> the cluster resources for certain periods of time, and in turn rejecting some 
> arriving jobs.
> We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
> the demand of the job throughout the allowed time-window according to a 
> global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598860#comment-14598860
 ] 

Hadoop QA commented on YARN-3800:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 17s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 48s | The applied patch generated  1 
new checkstyle issues (total was 54, now 49). |
| {color:green}+1{color} | whitespace |   0m  3s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 56s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 23s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741412/YARN-3800.004.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 49dfad9 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8330/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8330/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8330/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8330/console |


This message was automatically generated.

> Simplify inmemory state for ReservationAllocation
> -
>
> Key: YARN-3800
> URL: https://issues.apache.org/jira/browse/YARN-3800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3800.001.patch, YARN-3800.002.patch, 
> YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch
>
>
> Instead of storing the ReservationRequest we store the Resource for 
> allocations, as thats the only thing we need. Ultimately we convert 
> everything to resources anyway



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598786#comment-14598786
 ] 

Hadoop QA commented on YARN-3656:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  0s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   8m  1s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  5s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 26s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 50s | The applied patch generated  2 
new checkstyle issues (total was 12, now 12). |
| {color:red}-1{color} | whitespace |   0m  3s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 39s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 28s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  51m 55s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  92m  4s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741406/YARN-3656-v1.1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 49dfad9 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8329/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8329/console |


This message was automatically generated.

> LowCost: A Cost-Based Placement Agent for YARN Reservations
> ---
>
> Key: YARN-3656
> URL: https://issues.apache.org/jira/browse/YARN-3656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ishai Menache
>Assignee: Jonathan Yaniv
>  Labels: capacity-scheduler, resourcemanager
> Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
> YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf
>
>
> YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
> ahead of time. YARN-1710 introduced a greedy agent for placing user 
> reservations. The greedy agent makes fast placement decisions but at the cost 
> of ignoring the cluster committed resources, which might result in blocking 
> the cluster resources for certain periods of time, and in turn rejecting some 
> arriving jobs.
> We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
> the demand of the job throughout the allowed time-window according to a 
> global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-23 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598781#comment-14598781
 ] 

Ted Yu commented on YARN-3815:
--

[~jrottinghuis]:
Your description makes sense.
Cell tag is supported since hbase 0.98+ so we can use it to mark completion.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-23 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598759#comment-14598759
 ] 

Joep Rottinghuis commented on YARN-3815:


Thanks [~ted_yu] for that link. I did find that code and I'm reading through it.
Yes it uses a coprocessor on the reading side to "collapse" values together and 
permanently "collapse" them together on compaction.

I want to use a similar approach here. We cannot use the delta write directly 
as-is for the following reasons:
- For running applications, if we wanted to write only the increment the AM (or 
ATS writer) will have to keep track of the previous values in order to write 
the increment only. When the AM crashes and/or the ATS writer restarts we won't 
know what previous value we had written (and what has already been aggregated. 
So, we'd have to write the increment plus the latest value.
- Ergo, why don't we just write the latest value to begin with and leave off 
the increment. Now we cannot "collapse" the deltas / latest value until the 
application is done. Otherwise we would again loose track of what was 
previously aggregated.
So the new approach would be to write the latest value for an app and indicate 
(using a cell tag) that the app is done and that it can be a collapsed. We 
would use the co-processor only on the read-side just like with the delta write 
and that co-processor would aggregate values on the fly for reads and collapse 
during writes. Those writes would be limited to one single row, so we wouldn't 
have any weird cross-region locking issues, nor delays and hickups in the write 
throughput.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-06-23 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598710#comment-14598710
 ] 

Xuan Gong commented on YARN-221:


I think that we could have this configuration
{code}

yarn.container-log-aggregation-policy.class

org.apache.hadoop.yarn.container-log-aggregation-policy.SampleRateContainerLogAggregationPolicy

{code}
which can be used as default log-aggregation-policy. If the users do not 
specify the policy class in ASC, the default policy will be used

But maybe we do not need this one to specify the policy parameters:
{code}


yarn.container-log-aggregation-policy.class.SampleRateContainerLogAggregationPolicy
SR:0.2

{code}
Instead, we could set the default value for the policy. 

Also, in AppLogAggregator.java (From NM), after we parse the policy from ASC, 
we should do 
ContainerLogAggregationPolicy.parseParamter(ASC.logAggregationContext.getParamters()).

Others are fine to me.

> NM should provide a way for AM to tell it not to aggregate logs.
> 
>
> Key: YARN-221
> URL: https://issues.apache.org/jira/browse/YARN-221
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Reporter: Robert Joseph Evans
>Assignee: Ming Ma
> Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, 
> YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the 
> logs should not be aggregated, that they should be aggregated with a high 
> priority, or that they should be aggregated but with a lower priority.  The 
> AM should be able to do this in the ContainerLaunch context to provide a 
> default value, but should also be able to update the value when the container 
> is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid 
> connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3843) Fair Scheduler should not accept apps with space keys as queue name

2015-06-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598674#comment-14598674
 ] 

zhihai xu commented on YARN-3843:
-

[~dongwook], thanks for the confirmation!

> Fair Scheduler should not accept apps with space keys as queue name
> ---
>
> Key: YARN-3843
> URL: https://issues.apache.org/jira/browse/YARN-3843
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Dongwook Kwon
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3843.01.patch
>
>
> As YARN-461, since empty string queue name is not valid, queue name with 
> space keys such as " " ,"   " should not be accepted either, also not as 
> prefix nor postfix. 
> e.g) "root.test.queuename  ", or "root.test. queuename"
> I have 2 specific cases kill RM with these space keys as part of queue name.
> 1) Without placement policy (hadoop 2.4.0 and above), 
> When a job is submitted with " "(space key) as queue name
> e.g) mapreduce.job.queuename=" "
> 2) With placement policy (hadoop 2.5.0 and above)
>  Once a job is submitted without space key as queue name, and submit another 
> job with space key.
> e.g) 1st time: mapreduce.job.queuename="root.test.user1" 
> 2nd time: mapreduce.job.queuename="root.test.user1 "
> {code}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.974 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> testQueueNameWithSpace(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>   Time elapsed: 0.724 sec  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> QueueMetrics,q0=root,q1=adhoc,q2=birvine already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:218)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.(FSQueue.java:56)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.(FSLeafQueue.java:66)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createQueue(QueueManager.java:169)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:120)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:88)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:660)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:569)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1127)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testQueueNameWithSpace(TestFairScheduler.java:627)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-06-23 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598663#comment-14598663
 ] 

Ming Ma commented on YARN-221:
--

Thanks [~xgong]. How about the followings?

* Allow applications to specify the policy parameter via LogAggregationContext 
along with the policy class.

{noformat}
public abstract class LogAggregationContext {
public void setContainerLogPolicyClass(Class logPolicy);
public Class 
getContainerLogPolicyClass();
public void setParameters(String parameters);
public String getParameters();
}
{noformat}

* NM uses default cluster-wide settings via the following configurations. MR 
can override these configurations on per-application basis.

{noformat}

yarn.container-log-aggregation-policy.class

org.apache.hadoop.yarn.container-log-aggregation-policy.SampleRateContainerLogAggregationPolicy



yarn.container-log-aggregation-policy.class.SampleRateContainerLogAggregationPolicy
SR:0.2

{noformat}

* To support per-application policy, modify MR YarnRunner. We can also modify 
YarnClientImpl to read these configurations and set 
ApplicationSubmissionContext accordingly.

* The log aggregation policy object loaded in NM can be shared among different 
applications as long as they belong to same policy class with the same 
parameters.

> NM should provide a way for AM to tell it not to aggregate logs.
> 
>
> Key: YARN-221
> URL: https://issues.apache.org/jira/browse/YARN-221
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Reporter: Robert Joseph Evans
>Assignee: Ming Ma
> Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, 
> YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the 
> logs should not be aggregated, that they should be aggregated with a high 
> priority, or that they should be aggregated but with a lower priority.  The 
> AM should be able to do this in the ContainerLaunch context to provide a 
> default value, but should also be able to update the value when the container 
> is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid 
> connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3800) Simplify inmemory state for ReservationAllocation

2015-06-23 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3800:

Attachment: YARN-3800.004.patch

> Simplify inmemory state for ReservationAllocation
> -
>
> Key: YARN-3800
> URL: https://issues.apache.org/jira/browse/YARN-3800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3800.001.patch, YARN-3800.002.patch, 
> YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch
>
>
> Instead of storing the ReservationRequest we store the Resource for 
> allocations, as thats the only thing we need. Ultimately we convert 
> everything to resources anyway



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation

2015-06-23 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598647#comment-14598647
 ] 

Anubhav Dhoot commented on YARN-3800:
-

Addressed feedback

> Simplify inmemory state for ReservationAllocation
> -
>
> Key: YARN-3800
> URL: https://issues.apache.org/jira/browse/YARN-3800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3800.001.patch, YARN-3800.002.patch, 
> YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch
>
>
> Instead of storing the ReservationRequest we store the Resource for 
> allocations, as thats the only thing we need. Ultimately we convert 
> everything to resources anyway



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-23 Thread Jonathan Yaniv (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598623#comment-14598623
 ] 

Jonathan Yaniv commented on YARN-3656:
--

Thanks Carlo. I attached a new version of the patch (v1.1), in which we also 
implement GreedyReservationAgent using our algorithmic framework.
We verified that the behavior of the new version is identical to the original 
via simulations (= the implementations generated identical allocations) and 
unit tests (= the implementations behaved similarly on corner cases). We also 
ran test-patch locally on v1.1 of the patch and got +1.

> LowCost: A Cost-Based Placement Agent for YARN Reservations
> ---
>
> Key: YARN-3656
> URL: https://issues.apache.org/jira/browse/YARN-3656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ishai Menache
>Assignee: Jonathan Yaniv
>  Labels: capacity-scheduler, resourcemanager
> Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
> YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf
>
>
> YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
> ahead of time. YARN-1710 introduced a greedy agent for placing user 
> reservations. The greedy agent makes fast placement decisions but at the cost 
> of ignoring the cluster committed resources, which might result in blocking 
> the cluster resources for certain periods of time, and in turn rejecting some 
> arriving jobs.
> We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
> the demand of the job throughout the allowed time-window according to a 
> global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-23 Thread Jonathan Yaniv (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Yaniv updated YARN-3656:
-
Attachment: YARN-3656-v1.1.patch

> LowCost: A Cost-Based Placement Agent for YARN Reservations
> ---
>
> Key: YARN-3656
> URL: https://issues.apache.org/jira/browse/YARN-3656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ishai Menache
>Assignee: Jonathan Yaniv
>  Labels: capacity-scheduler, resourcemanager
> Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
> YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf
>
>
> YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
> ahead of time. YARN-1710 introduced a greedy agent for placing user 
> reservations. The greedy agent makes fast placement decisions but at the cost 
> of ignoring the cluster committed resources, which might result in blocking 
> the cluster resources for certain periods of time, and in turn rejecting some 
> arriving jobs.
> We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
> the demand of the job throughout the allowed time-window according to a 
> global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation

2015-06-23 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598612#comment-14598612
 ] 

Subru Krishnan commented on YARN-3800:
--

Thanks [~adhoot] for the updated patch. Overall it looks good, a few minor nits:
   * Can we rename _ReservationUtil_ to _ReservationSystemUtil_ to avoid 
confusion.
   * In _TestInMemoryPlan_, can we use *allocations* instead of *allocs* to 
minimize the diff.
   * In _TestInMemoryReservationAllocation_, we can continue using the previous 
constructor for non-gang allocations as the flag is required only for gang.
   * There is a redundant format change in _TestInMemoryReservationAllocation_ :
bq. -Assert.assertEquals(allocations, rAllocation.getAllocationRequests());
+Assert.assertEquals(allocations,
+rAllocation.getAllocationRequests());

> Simplify inmemory state for ReservationAllocation
> -
>
> Key: YARN-3800
> URL: https://issues.apache.org/jira/browse/YARN-3800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3800.001.patch, YARN-3800.002.patch, 
> YARN-3800.002.patch, YARN-3800.003.patch
>
>
> Instead of storing the ReservationRequest we store the Resource for 
> allocations, as thats the only thing we need. Ultimately we convert 
> everything to resources anyway



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598590#comment-14598590
 ] 

Sangjin Lee commented on YARN-3815:
---

Moving from offline discussions...

Now aggregation of *time series metrics* is rather tricky, and needs to be 
defined. Would an aggregated metric (e.g. at the flow level) of time series 
metrics (e.g. at the app level) be a time series itself? I see several problems 
with defining that as a time series. Individual app time series may be sampled 
at different times, and it's not clear what time series the aggregated flow 
metric would be.

I think it might be simpler to say that an aggregated flow metric of time 
series may not need to be a time series itself.

On the one hand, there is a general issue of at what time the aggregated values 
belong, regardless of whether they are time series or not. If all leaf values 
are recorded at the same time, it would be unambiguous. The aggregated metric 
value is of the same time. However, it is rarely the case.

I think the current implicit behavior in hadoop is simply to take the latest 
values and add them up. One example is the MR counters (task level and job 
level). The task level counters are obtained at different times. Still, the 
corresponding job counters are simply sums of all the latest task counters, 
although they may have been taken at different times. We're basically taking 
that as an approximation that's "good enough". In the end, the final numbers 
will become accurate. In other words, the final values would truly be the 
accurate aggregate values.

The time series basically adds another wrinkle to this. In case of a simple 
value, the final values are going to be correct, so this problem is less of an 
issue, but time series will retain intermediate values. Furthermore, their 
publishing interval may have no relationship with the publishing interval of 
the leaf values. I think the baseline approach should be either (1) do not use 
time series for the aggregated metrics, or (2) just to the best effort 
approximation by adding up the latest leaf values and store it with its own 
timestamp.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598577#comment-14598577
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
About flow online aggregation, I am not quite sure on requirement yet. Do we 
really want real time for flow aggregated data or some fine-grained time 
interval (like 15 secs) should be good enough - if we want to show some nice 
metrics chart for flow, this should be fine.
{quote}

Yes, I agree with that. When I said "real time", it doesn't mean real time in 
the sense that every metric is accurate to the second. Most likely raw data 
themselves (e.g. container data) are written on an interval anyway. Some type 
of time interval for aggregation is implied.

{quote}
Any special reason not to handle it in the same way above - as HBase 
coprocessor? It just sound like gross-grained time interval. Isn't it?
{quote}

I do see your point in that what I called the "real time" aggregation can be 
considered the same type of aggregation as the "offline" aggregation only on a 
shorter time interval. However, we also need to think about the use cases of 
such aggregated data.

The former type of aggregation is very much something that can be plugged into 
UI such as the RM UI or ambari to show more immediate data. These data may 
change as the user refreshes the UI. So this is closer to the raw data.

On the other hand, the latter type of aggregation lends itself to more 
analytical and ad-hoc analysis of data. These can be used for calculating 
chargebacks, usage trending, reporting, etc. Perhaps it could even contain more 
detailed info than the "real time" aggregated data for the reporting/data 
mining purposes. And that's where we would like to consider using phoenix to 
enable arbitrary ad-hoc SQL queries.

One analogy [~jrottinghuis] brings up regarding this is OLTP v. OLAP.

That's why we also think it makes sense to do only "offline" (time-based) 
aggregation for users and queues. At least in our case with hRaven, there 
hasn't been a compelling reason to show user- or queue-aggregated data in 
semi-real time. It has been perfectly adequate to show time-based aggregation, 
as data like this tend to be used more for reporting and analysis.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598565#comment-14598565
 ] 

Hadoop QA commented on YARN-3069:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 25s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 48s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   3m  4s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  4s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  23m 11s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| | |  75m 23s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741380/YARN-3069.013.patch |
| Optional Tests | site javadoc javac unit findbugs checkstyle |
| git revision | trunk / 122cad6 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8328/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8328/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8328/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8328/console |


This message was automatically generated.

> Document missing properties in yarn-default.xml
> ---
>
> Key: YARN-3069
> URL: https://issues.apache.org/jira/browse/YARN-3069
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: BB2015-05-TBR, supportability
> Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
> YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
> YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
> YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, 
> YARN-3069.012.patch, YARN-3069.013.patch
>
>
> The following properties are currently not defined in yarn-default.xml.  
> These properties should either be
>   A) documented in yarn-default.xml OR
>   B)  listed as an exception (with comments, e.g. for internal use) in the 
> TestYarnConfigurationFields unit test
> Any comments for any of the properties below are welcome.
>   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
>   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
>   security.applicationhistory.protocol.acl
>   yarn.app.container.log.backups
>   yarn.app.container.log.dir
>   yarn.app.container.log.filesize
>   yarn.client.app-submission.poll-interval
>   yarn.client.application-client-protocol.poll-timeout-ms
>   yarn.is.minicluster
>   yarn.log.server.url
>   yarn.minicluster.control-resource-monitoring
>   yarn.minicluster.fixed.ports
>   yarn.minicluster.use-rpc
>   yarn.node-labels.fs-store.retry-policy-spec
>   yarn.node-labels.fs-store.root-dir
>   yarn.node-labels.manager-class
>   yarn.nodemanager.container-executor.os.sched.priority.adjustment
>   yarn.nodemanager.container-monitor.process-tree.class
>   yarn.nodemanager.disk-health-checker.enable
>   yarn.nodemanager.docker-container-executor.image-name
>   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>   yarn.nodemanager.linux-container-executor.group
>   yarn.nodemanager.log.deletion-threads-count
>   yarn.nodemanager.user-home-dir
>   yarn.nodemanager.webapp.https.address
>   yarn.nodemanager.webapp.spnego-keytab-file
>   yarn.nodemanager.webapp.spnego-principal
>   yarn.nodemanager.windows-secure-container-executor.group

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598556#comment-14598556
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
AM currently leverage YARN's AppTimelineCollector to forward entities to 
backend storage, so making AM talk directly to backend storage is not 
considered to be safe.
{quote}

Just to be clear, I'm *not* proposing AMs writing directly to the backend 
storage. AMs continue to write through the app-level timeline collector. My 
proposal is that the AMs are responsible for setting the aggregated 
framework-specific metric values on the *YARN application entities*.

Let's consider the example of MR. MR itself would have its own entities such as 
job, tasks, and task attempts. These are distinct entities from the YARN 
entities such as application, app attempts, and containers. We can either (1) 
have the MR AM set framework-specific metric values at the YARN container 
entities and have YARN aggregate them to applications, or (2) have the MR AM 
set the aggregated values on the applications for itself.

I feel the latter approach is conceptually cleaner. The framework is ultimately 
responsible for its metrics (YARN doesn't even know what metrics there are). We 
could decide that YARN would look at the framework-specific metrics at the app 
level and aggregate them from the app level onward to flows, user, and queue.

In addition, most frameworks already have an aggregated view of the metrics. It 
would be very straightforward to emit them at the app level.

In summary, option (1) asks the framework to write metrics on its own entities 
(job, tasks, task attempts) plus YARN container entities. Option (2) asks the 
framework to write metrics on its own entities (job, tasks, task attempts) plus 
YARN app entities. IMO, the latter is a more reliable approach. We can discuss 
this further...

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-23 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598529#comment-14598529
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~zxu] Do you have any scenarios the latest patch doesn't cover? 

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 10:09:44,887 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> 2015-06-09 10:09:44,887 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
> updating appAttempt: appattempt_1433764310492_7152_01
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at

[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-06-23 Thread LINTE (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LINTE updated YARN-3840:

Summary: Resource Manager web ui issue when sorting application by id (with 
application having id > )  (was: Resource Manager web ui issue when sorting 
application by id with id highter than )

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Centos 6.6
> Java 1.7
>Reporter: LINTE
> Attachments: RMApps.png
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id with id highter than 9999

2015-06-23 Thread LINTE (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LINTE updated YARN-3840:

Summary: Resource Manager web ui issue when sorting application by id with 
id highter than   (was: Resource Manager web ui bug on main view after 
application number )

> Resource Manager web ui issue when sorting application by id with id highter 
> than 
> --
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Centos 6.6
> Java 1.7
>Reporter: LINTE
> Attachments: RMApps.png
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui bug on main view after application number 9999

2015-06-23 Thread LINTE (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598419#comment-14598419
 ] 

LINTE commented on YARN-3840:
-

Hi,

xgong, Yes with 2.7.0 yarn version

devarak.j, Yes i confirm this is an asc/desc sortingissue with application id 
over .

Regards,




> Resource Manager web ui bug on main view after application number 
> --
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Centos 6.6
> Java 1.7
>Reporter: LINTE
> Attachments: RMApps.png
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598384#comment-14598384
 ] 

Jason Lowe commented on YARN-2902:
--

Thanks for updating the patch, Varun!

Is one second enough time for the localizer to tear down if the system is 
heavily loaded, disks are slow, etc.?  I think it would be better for the 
executor to let us know when a localizer has completed rather than assuming 1 
second will be enough time (or too much time).  We can tackle this in a 
followup JIRA since it's a more significant change, as I'm not sure executors 
are tracking localizers today.

There are a number of sleeps in the unit test which we should try to avoid if 
possible.  Is there a reason dispatcher.await() isn't sufficient to avoid the 
races?  At a minimum there should be a comment for each one explaining what 
we're trying to avoid by sleeping.

Nit: I've always interpreted the debug delay to be a delay to execute in 
debugging just before the NM deletes a file.  To be consistent it seems that we 
should be adding the debug delay to any requested delay.  That way the NM will 
always preserve a file for debugDelay seconds _beyond_ what an NM with 
debugDelay=0 seconds would do.

Nit: The TODO in DeletionService about parent being owned by NM, etc. probably 
only needs to be in the delete method that actually does the work rather than 
duplicated in veneer methods.

Nit: Should "Container killed while downloading" be "Container killed while 
localizing"?  We use localizing elsewhere (e.g.: NM log UI when trying to get 
logs of a container that is still localizing).

Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for 
PRIVATE localization: " to fix typo and add trailing space for subsequent 
filename.  Missing a trailing space on the next log message as well.  Realize 
this was just a pre-existing bug, but it would be nice to fix as part of moving 
the code.



> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-23 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-3069:
-
Attachment: YARN-3069.013.patch

- Fix whitespace
- Update against trunk

> Document missing properties in yarn-default.xml
> ---
>
> Key: YARN-3069
> URL: https://issues.apache.org/jira/browse/YARN-3069
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: BB2015-05-TBR, supportability
> Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
> YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
> YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
> YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, 
> YARN-3069.012.patch, YARN-3069.013.patch
>
>
> The following properties are currently not defined in yarn-default.xml.  
> These properties should either be
>   A) documented in yarn-default.xml OR
>   B)  listed as an exception (with comments, e.g. for internal use) in the 
> TestYarnConfigurationFields unit test
> Any comments for any of the properties below are welcome.
>   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
>   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
>   security.applicationhistory.protocol.acl
>   yarn.app.container.log.backups
>   yarn.app.container.log.dir
>   yarn.app.container.log.filesize
>   yarn.client.app-submission.poll-interval
>   yarn.client.application-client-protocol.poll-timeout-ms
>   yarn.is.minicluster
>   yarn.log.server.url
>   yarn.minicluster.control-resource-monitoring
>   yarn.minicluster.fixed.ports
>   yarn.minicluster.use-rpc
>   yarn.node-labels.fs-store.retry-policy-spec
>   yarn.node-labels.fs-store.root-dir
>   yarn.node-labels.manager-class
>   yarn.nodemanager.container-executor.os.sched.priority.adjustment
>   yarn.nodemanager.container-monitor.process-tree.class
>   yarn.nodemanager.disk-health-checker.enable
>   yarn.nodemanager.docker-container-executor.image-name
>   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>   yarn.nodemanager.linux-container-executor.group
>   yarn.nodemanager.log.deletion-threads-count
>   yarn.nodemanager.user-home-dir
>   yarn.nodemanager.webapp.https.address
>   yarn.nodemanager.webapp.spnego-keytab-file
>   yarn.nodemanager.webapp.spnego-principal
>   yarn.nodemanager.windows-secure-container-executor.group
>   yarn.resourcemanager.configuration.file-system-based-store
>   yarn.resourcemanager.delegation-token-renewer.thread-count
>   yarn.resourcemanager.delegation.key.update-interval
>   yarn.resourcemanager.delegation.token.max-lifetime
>   yarn.resourcemanager.delegation.token.renew-interval
>   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
>   yarn.resourcemanager.metrics.runtime.buckets
>   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
>   yarn.resourcemanager.reservation-system.class
>   yarn.resourcemanager.reservation-system.enable
>   yarn.resourcemanager.reservation-system.plan.follower
>   yarn.resourcemanager.reservation-system.planfollower.time-step
>   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
>   yarn.resourcemanager.webapp.spnego-keytab-file
>   yarn.resourcemanager.webapp.spnego-principal
>   yarn.scheduler.include-port-in-node-name
>   yarn.timeline-service.delegation.key.update-interval
>   yarn.timeline-service.delegation.token.max-lifetime
>   yarn.timeline-service.delegation.token.renew-interval
>   yarn.timeline-service.generic-application-history.enabled
>   
> yarn.timeline-service.generic-application-history.fs-history-store.compression-type
>   yarn.timeline-service.generic-application-history.fs-history-store.uri
>   yarn.timeline-service.generic-application-history.store-class
>   yarn.timeline-service.http-cross-origin.enabled
>   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3793:
---
Assignee: Varun Saxena  (was: Karthik Kambatla)

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598326#comment-14598326
 ] 

Karthik Kambatla commented on YARN-3793:


[~varun_saxena] - all yours. 

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598313#comment-14598313
 ] 

Sangjin Lee commented on YARN-3045:
---

I took a quick pass at the latest patch. First, could you look at the 
checkstyle issue and the unit test failure?

I think the unit test failure is an "existing" issue, but since you looked at 
it for YARN-3792, it'd be great if you could take another look. It looks like 
even the APPLICATION_CREATED_EVENT might be seeing the race condition?

(NMTimelinePublisher.java)
- I'm not 100% clear about the naming convention, but I was under the 
impression that we're sticking with the name "timelineservice" as the package 
name? Is it not the case?
- l.223: minor nit, but let's make inner classes static unless they need to be 
non-static
- l.252: I'm a bit puzzled by the hashCode override; is it necessary? If so, 
then we should also override equals. And also, why is it going by only on the 
app id?
- l.296: the same question here


> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598301#comment-14598301
 ] 

Varun Saxena commented on YARN-3793:


Thanks for the pointing this out. Looked for scenarios when disk becomes bad 
and found one issue.

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598300#comment-14598300
 ] 

Varun Saxena commented on YARN-3793:


[~kasha], can I work on this JIRA ?

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598299#comment-14598299
 ] 

Varun Saxena commented on YARN-3793:


[~kasha], I think I know whats happening.
When disks become bad(say due to disk full), there is a problem when uploading 
container logs.

In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories 
are considered for log aggregation. This leads to 
{{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} returning no 
log files to be uploaded.

The caller of {{doContainerLogAggregation}} is 
{{AppLogAggregatorImpl#uploadLogsForContainers}} which as can be seen under 
will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}} is 
empty *(which will be if disks are full)*, this will lead to both sub directory 
and base directories being null. This explains the NPEs' being thrown.
When these deletion tasks are stored in state store, they will be stored with 
nulls as well and this can explain why it happens on recovery as well.
{code}
  boolean uploadedLogsInThisCycle = false;
  for (ContainerId container : pendingContainerInThisCycle) {
ContainerLogAggregator aggregator = null;
if (containerLogAggregators.containsKey(container)) {
  aggregator = containerLogAggregators.get(container);
} else {
  aggregator = new ContainerLogAggregator(container);
  containerLogAggregators.put(container, aggregator);
}
Set uploadedFilePathsInThisCycle =
aggregator.doContainerLogAggregation(writer, appFinished);
if (uploadedFilePathsInThisCycle.size() > 0) {
  uploadedLogsInThisCycle = true;
}
this.delService.delete(this.userUgi.getShortUserName(), null,
  uploadedFilePathsInThisCycle
.toArray(new Path[uploadedFilePathsInThisCycle.size()]));
   ..
   }
{code}

Log aggregation should consider full disks as well otherwise there will be 
nothing to be aggregated if disks are full. Anyways log aggregation would lead 
to deletion of local logs.

I verified the occurrence of this issue via 
TestLogAggregationService#testLocalFileDeletionAfterUpload by making good log 
directories return nothing.


> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-23 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598163#comment-14598163
 ] 

Chris Douglas commented on YARN-3806:
-

[~wshao] Please don't delete obsoleted versions of the design doc, as it 
orphans discussion about them. Also, as you're making updates, please note the 
changes so people don't have to diff the docs.

> Proposal of Generic Scheduling Framework for YARN
> -
>
> Key: YARN-3806
> URL: https://issues.apache.org/jira/browse/YARN-3806
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Wei Shao
> Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.05.pdf, 
> ProposalOfGenericSchedulingFrameworkForYARN-V1.06.pdf
>
>
> Currently, a typical YARN cluster runs many different kinds of applications: 
> production applications, ad hoc user applications, long running services and 
> so on. Different YARN scheduling policies may be suitable for different 
> applications. For example, capacity scheduling can manage production 
> applications well since application can get guaranteed resource share, fair 
> scheduling can manage ad hoc user applications well since it can enforce 
> fairness among users. However, current YARN scheduling framework doesn’t have 
> a mechanism for multiple scheduling policies work hierarchically in one 
> cluster.
> YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
> proposed a per-queue policy driven framework. In detail, it supported 
> different scheduling policies for leaf queues. However, support of different 
> scheduling policies for upper level queues is not seriously considered yet. 
> A generic scheduling framework is proposed here to address these limitations. 
> It supports different policies (fair, capacity, fifo and so on) for any queue 
> consistently. The proposal tries to solve many other issues in current YARN 
> scheduling framework as well.
> Two new proposed scheduling policies YARN-3807 & YARN-3808 are based on 
> generic scheduling framework brought up in this proposal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598054#comment-14598054
 ] 

Sangjin Lee commented on YARN-3045:
---

{quote}
The lifecycle management of app collector is a little tricky here: it get 
registered when the first container (AM) get launched, but should not 
unregistered immediately when AM container get stop. May be wait for 
application finish event comes to NM should work for most cases. For corner 
case that NM publisher delay too long time (queue is busy) to publish event, it 
still get chance to fail (very low chance should be acceptable here). Later, we 
will run to similar issue again when we are doing app level aggregation in app 
collector that the aggregation process could still be running. In any case, we 
should pay special attention to lifecycle management for collector - we have a 
separated JIRA to move it out of auxiliary service. I think we can discuss more 
on this together with/in that JIRA.
{quote}

It's a good point. I think some amount of "linger" after the AM container is 
completed should be a fine solution. Note that not only the collector needs to 
be up but also the mapping should not be removed from the RM for this to work.

As [~djp] pointed out, having multiple app attempts (AMs) is another case. 
Perhaps the same linger can apply in that case so that the collector can stick 
around to handle some writes until the next collector that belongs to the next 
AM comes online and registers itself. We need to hash out the details of 
multiple AMs scenario, preferably in a different JIRA.

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it

2015-06-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598050#comment-14598050
 ] 

zhihai xu commented on YARN-3831:
-

[~hex108], thanks for the confirmation!

> Localization failed when a local disk turns from bad to good without NM 
> initializes it
> --
>
> Key: YARN-3831
> URL: https://issues.apache.org/jira/browse/YARN-3831
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> A local disk turns from bad to good without NM initializes it(create 
> /path-to-local-dir/usercache and /path-to-local-dir/filecache). When 
> localizing a container, container-executor will try to create directories 
> under /path-to-local-dir/usercache, and it will fail. Then container's 
> localization will fail. 
> Related log is as following:
> {noformat}
> 2015-06-19 18:00:01,205 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1431957472783_38706012_01_000465
> 2015-06-19 18:00:01,212 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens.
>  Credentials list: 
> 2015-06-19 18:00:01,216 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1431957472783_38706012_01_000465 startLocalizer is : 
> 20
> org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command 
> provided 0
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
> tdwadmin
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create 
> directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.IOException: Application application_1431957472783_38706012 
> initialization failed (exitCode=20) with output: main : command provided 0
> main : user is tdwadmin
> Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such 
> file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> ... 1 more
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1431957472783_38706012_01_000465 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-23 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598044#comment-14598044
 ] 

Masatake Iwasaki commented on YARN-3790:


I'm +1(non-binding) too. Thanks for working on this. I saw the test failure 2 
times on YARN-3705 and would like this to come in.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, test
>Reporter: Rohith Sharma K S
>Assignee: zhihai xu
> Attachments: YARN-3790.000.patch
>
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598036#comment-14598036
 ] 

Hadoop QA commented on YARN-2801:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 55s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 19s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   3m  0s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 17s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741328/YARN-2801.3.patch |
| Optional Tests | site |
| git revision | trunk / 41ae776 |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8327/console |


This message was automatically generated.

> Documentation development for Node labels requirment
> 
>
> Key: YARN-2801
> URL: https://issues.apache.org/jira/browse/YARN-2801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Gururaj Shetty
>Assignee: Wangda Tan
> Attachments: YARN-2801.1.patch, YARN-2801.2.patch, YARN-2801.3.patch
>
>
> Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598027#comment-14598027
 ] 

Sangjin Lee commented on YARN-2902:
---

I'm OK with this JIRA proceeding as is. We'll need to isolate the public 
resource case more, and it won't be too late to file a separate issue if we do 
that later.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment

2015-06-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2801:
-
Attachment: YARN-2801.3.patch

Thanks [~Naganarasimha] for additional review, attached ver.3 patch.

> Documentation development for Node labels requirment
> 
>
> Key: YARN-2801
> URL: https://issues.apache.org/jira/browse/YARN-2801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Gururaj Shetty
>Assignee: Wangda Tan
> Attachments: YARN-2801.1.patch, YARN-2801.2.patch, YARN-2801.3.patch
>
>
> Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598012#comment-14598012
 ] 

Junping Du commented on YARN-3045:
--

Thanks [~Naganarasimha] for updating the patch! Looking into it now, some 
comments will be after.
Some quickly thoughts on your question above.
bq. I prefer to have all the container related events and entities to be 
published by NMTimelinePublisher, so wanted push container usage metrics also 
to NMTimelinePublisher. This will ensure all NM timeline stuff are put in one 
place and remove thread pool handling in ContainerMonitorImpl.
I am generally fine for consolidating the publishment of events and metrics 
with NMTimelinePublisher. However, we may check if need separated event queue 
later to make sure container metrics boom up won't affect events get published.

bq. When the AM container finishes and removes the collector for the app, still 
there is possibility that all the events published for the app by the current 
NM and other NM are still in pipeline, so was wondering whether we can have 
timer task which periodically cleans up collector after some period and not imm 
remove it when AM container is finished.
The lifecycle management of app collector is a little tricky here: it get 
registered when the first container (AM) get launched, but should not 
unregistered immediately when AM container get stop. May be wait for 
application finish event comes to NM should work for most cases. For corner 
case that NM publisher delay too long time (queue is busy) to publish event, it 
still get chance to fail (very low chance should be acceptable here). Later, we 
will run to similar issue again when we are doing app level aggregation in app 
collector that the aggregation process could still be running. In any case, we 
should pay special attention to lifecycle management for collector - we have a 
separated JIRA to move it out of auxiliary service. I think we can discuss more 
on this together with/in that JIRA.

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597971#comment-14597971
 ] 

Masatake Iwasaki commented on YARN-2871:


Thanks for working on this, [~zxu]! These intermittent test failures annoyed me 
these days.

{code}
980 Thread.sleep(1000);
{code}

Is it possible to use {{MockRM#waitForState}} to wait for the application state 
is recovered? Sleeping fixed time is not certain and it makes test time longer 
unnecessary, though there are many lines calling Thread#sleep in the test...

> TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
> -
>
> Key: YARN-2871
> URL: https://issues.apache.org/jira/browse/YARN-2871
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2871.000.patch
>
>
> From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
> {code}
> Failed tests:
>   TestRMRestart.testRMRestartGetApplicationList:957
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597938#comment-14597938
 ] 

Hadoop QA commented on YARN-2902:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 47s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  9 
new checkstyle issues (total was 168, now 138). |
| {color:green}+1{color} | whitespace |   0m  3s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m  6s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  43m 32s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741309/YARN-2902.04.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 41ae776 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8326/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8326/console |


This message was automatically generated.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container

2015-06-23 Thread Lei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597920#comment-14597920
 ] 

Lei Guo commented on YARN-1488:
---

Based on the information from [Stinger.next track | 
http://hortonworks.com/blog/evolving-apache-hadoop-yarn-provide-resource-workload-management-services/],
 this Jira should be the foundation of the YARN/LLAP integration, is there any 
plan/design for this Jira?

> Allow containers to delegate resources to another container
> ---
>
> Key: YARN-1488
> URL: https://issues.apache.org/jira/browse/YARN-1488
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
>
> We should allow containers to delegate resources to another container. This 
> would allow external frameworks to share not just YARN's resource-management 
> capabilities but also it's workload-management capabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597873#comment-14597873
 ] 

Jason Lowe commented on YARN-3832:
--

Ah, I think that might be the clue as to what went wrong.  If the NM recreated 
the state store on startup then ResourceLocalizationService will try to cleanup 
the localized resources to prevent them from getting out of sync with the state 
store.  Unfortunately the code does this:
{code}
  private void cleanUpLocalDirs(FileContext lfs, DeletionService del) {
for (String localDir : dirsHandler.getLocalDirs()) {
  cleanUpLocalDir(lfs, del, localDir);
}
{code}

It should be calling dirsHandler.getLocalDirsForCleanup, since getLocalDirs 
will not include any disks that are full.  Since the disk was too full, it 
probably wasn't in the list of local dirs and therefore we avoided cleaning up 
the localized resources on the disk.  Later when the disk became good it tried 
to use it, but at that point the state store and localized resources on that 
disk are out of sync and new localizations can collide with old ones.

> Resource Localization fails on a cluster due to existing cache directories
> --
>
> Key: YARN-3832
> URL: https://issues.apache.org/jira/browse/YARN-3832
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Ranga Swamy
>Assignee: Brahma Reddy Battula
>
>  *We have found resource localization fails on a cluster with following 
> error.* 
>  
> Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
> {noformat}
> Application application_1434703279149_0057 failed 2 times due to AM Container 
> for appattempt_1434703279149_0057_02 exited with exitCode: -1000
> For more detailed output, check application tracking 
> page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
>  click on links to logs of each attempt.
> Diagnostics: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> java.io.IOException: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
> at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
> at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Failing this attempt. Failing the application.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597863#comment-14597863
 ] 

Varun Saxena commented on YARN-2902:


Fixed checkstyle issues. A lot of changes in 
{{ResourceLocalizationService#findNextResource}} are due to indentation issues 
reported by checkstyle. Hence, had to change code(indent) which I had not 
written. 

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2902:
---
Attachment: YARN-2902.04.patch

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597825#comment-14597825
 ] 

Hudson commented on YARN-3842:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597823#comment-14597823
 ] 

Hudson commented on YARN-3835:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-23 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597817#comment-14597817
 ] 

Brahma Reddy Battula commented on YARN-3793:


[~kasha] one possible scenario is : When disk became bad and NM stopped.. I had 
seen this NPE( where good dir's will be null)..

{noformat}
2015-06-19 03:09:10,528 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Uploading logs for container container_1434452428753_0522_01_000162. Current 
good log dirs are 
2015-06-19 03:09:10,528 ERROR 
org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
execution of task in DeletionService
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
at org.apache.hadoop.fs.FileContext.delete(FileContext.java:761)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
at 
org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597810#comment-14597810
 ] 

Jason Lowe commented on YARN-3809:
--

+1 latest patch lgtm.  Will commit this later today if there are no objections.

> Failed to launch new attempts because ApplicationMasterLauncher's threads all 
> hang
> --
>
> Key: YARN-3809
> URL: https://issues.apache.org/jira/browse/YARN-3809
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
> YARN-3809.03.patch
>
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
> AMLauncherEventType(LAUNCH and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut 
> down for some reason. After RM found the NM LOST, it cleaned up AMs running 
> on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
> ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
> in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
> down, the default RPC time out is 15 mins. It means that in 15 mins 
> ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
> attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-23 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597805#comment-14597805
 ] 

Brahma Reddy Battula commented on YARN-3832:


[~jlowe] Sorry for late reply...After look into logs,,Came to know that  *disk 
declared bad ( since it's reached 90%) and nodes became unhealthy* 

{noformat}
2015-06-19 04:39:18,498 WARN 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory 
/opt/hdfsdata/HA/nmlocal error, used space above threshold of 90.0%, removing 
from list of valid directories
2015-06-19 04:39:18,498 WARN 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory 
/opt/hdfsdata/HA/nmlog error, used space above threshold of 90.0%, removing 
from list of valid directories
2015-06-19 04:39:18,498 INFO 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) 
failed: 1/1 local-dirs are bad: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs are bad: 
/opt/hdfsdata/HA/nmlog
2015-06-19 04:39:18,499 ERROR 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the 
disks failed. 1/1 local-dirs are bad: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs 
are bad: /opt/hdfsdata/HA/nmlog
{noformat}

On restart of NM, those disk turn to good..

2015-06-19 04:47:18,765 INFO 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) 
turned good: 1/1 local-dirs are good: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs 
are good: /opt/hdfsdata/HA/nmlog..



> Resource Localization fails on a cluster due to existing cache directories
> --
>
> Key: YARN-3832
> URL: https://issues.apache.org/jira/browse/YARN-3832
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Ranga Swamy
>Assignee: Brahma Reddy Battula
>
>  *We have found resource localization fails on a cluster with following 
> error.* 
>  
> Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
> {noformat}
> Application application_1434703279149_0057 failed 2 times due to AM Container 
> for appattempt_1434703279149_0057_02 exited with exitCode: -1000
> For more detailed output, check application tracking 
> page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
>  click on links to logs of each attempt.
> Diagnostics: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> java.io.IOException: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
> at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
> at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Failing this attempt. Failing the application.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3844) Make hadoop-yarn-project Native code -Wall-clean

2015-06-23 Thread Alan Burlison (JIRA)
Alan Burlison created YARN-3844:
---

 Summary: Make hadoop-yarn-project Native code -Wall-clean
 Key: YARN-3844
 URL: https://issues.apache.org/jira/browse/YARN-3844
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.7.0
 Environment: As we specify -Wall as a default compilation flag, it 
would be helpful if the Native code was -Wall-clean
Reporter: Alan Burlison
Assignee: Alan Burlison






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3844) Make hadoop-yarn-project Native code -Wall-clean

2015-06-23 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3844:

Description: As we specify -Wall as a default compilation flag, it would be 
helpful if the Native code was -Wall-clean

> Make hadoop-yarn-project Native code -Wall-clean
> 
>
> Key: YARN-3844
> URL: https://issues.apache.org/jira/browse/YARN-3844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 2.7.0
> Environment: As we specify -Wall as a default compilation flag, it 
> would be helpful if the Native code was -Wall-clean
>Reporter: Alan Burlison
>Assignee: Alan Burlison
>
> As we specify -Wall as a default compilation flag, it would be helpful if the 
> Native code was -Wall-clean



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597783#comment-14597783
 ] 

Hudson commented on YARN-3842:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* hadoop-yarn-project/CHANGES.txt


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597781#comment-14597781
 ] 

Hudson commented on YARN-3835:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597722#comment-14597722
 ] 

Hudson commented on YARN-3842:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597720#comment-14597720
 ] 

Hudson commented on YARN-3835:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597706#comment-14597706
 ] 

Hudson commented on YARN-3835:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597708#comment-14597708
 ] 

Hudson commented on YARN-3842:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* hadoop-yarn-project/CHANGES.txt


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-23 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla reassigned YARN-1965:
-

Assignee: Kuhu Shukla

> Interrupted exception when closing YarnClient
> -
>
> Key: YARN-1965
> URL: https://issues.apache.org/jira/browse/YARN-1965
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 2.3.0
>Reporter: Oleg Zhurakousky
>Assignee: Kuhu Shukla
>Priority: Minor
>  Labels: newbie
>
> Its more of a nuisance then a bug, but nevertheless 
> {code}
> 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
> for clientExecutorto stop
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
>   at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
>   at 
> org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
>   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
>   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
>   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
>   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
>   at 
> org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
>   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> . . .
> {code}
> It happens sporadically when stopping YarnClient. 
> Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
> why and who throws the interrupt but in any event it should not be logged as 
> ERROR. Probably a WARN with no stack trace.
> Also, for consistency and correctness you may want to Interrupt current 
> thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-23 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597622#comment-14597622
 ] 

Jun Gong commented on YARN-3809:


Same as previous explanation, checkstyle and test case error are not related.

> Failed to launch new attempts because ApplicationMasterLauncher's threads all 
> hang
> --
>
> Key: YARN-3809
> URL: https://issues.apache.org/jira/browse/YARN-3809
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
> YARN-3809.03.patch
>
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
> AMLauncherEventType(LAUNCH and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut 
> down for some reason. After RM found the NM LOST, it cleaned up AMs running 
> on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
> ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
> in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
> down, the default RPC time out is 15 mins. It means that in 15 mins 
> ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
> attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it

2015-06-23 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3831.

Resolution: Not A Problem

> Localization failed when a local disk turns from bad to good without NM 
> initializes it
> --
>
> Key: YARN-3831
> URL: https://issues.apache.org/jira/browse/YARN-3831
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> A local disk turns from bad to good without NM initializes it(create 
> /path-to-local-dir/usercache and /path-to-local-dir/filecache). When 
> localizing a container, container-executor will try to create directories 
> under /path-to-local-dir/usercache, and it will fail. Then container's 
> localization will fail. 
> Related log is as following:
> {noformat}
> 2015-06-19 18:00:01,205 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1431957472783_38706012_01_000465
> 2015-06-19 18:00:01,212 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens.
>  Credentials list: 
> 2015-06-19 18:00:01,216 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1431957472783_38706012_01_000465 startLocalizer is : 
> 20
> org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command 
> provided 0
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
> tdwadmin
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create 
> directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.IOException: Application application_1431957472783_38706012 
> initialization failed (exitCode=20) with output: main : command provided 0
> main : user is tdwadmin
> Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such 
> file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> ... 1 more
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1431957472783_38706012_01_000465 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it

2015-06-23 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597616#comment-14597616
 ] 

Jun Gong commented on YARN-3831:


[~zxu], thank you for the remind. Sorry for late reply.

The bug was found in version 2.2.0. I checked latest code. It seems have been 
fixed: there is a 'localDirsChangeListener' to handle 'onDirsChanged', when a 
local disk turns from bad to good,  'localDirsChangeListener' will try to 
initialize it.

Closing it now.

> Localization failed when a local disk turns from bad to good without NM 
> initializes it
> --
>
> Key: YARN-3831
> URL: https://issues.apache.org/jira/browse/YARN-3831
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> A local disk turns from bad to good without NM initializes it(create 
> /path-to-local-dir/usercache and /path-to-local-dir/filecache). When 
> localizing a container, container-executor will try to create directories 
> under /path-to-local-dir/usercache, and it will fail. Then container's 
> localization will fail. 
> Related log is as following:
> {noformat}
> 2015-06-19 18:00:01,205 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1431957472783_38706012_01_000465
> 2015-06-19 18:00:01,212 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens.
>  Credentials list: 
> 2015-06-19 18:00:01,216 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1431957472783_38706012_01_000465 startLocalizer is : 
> 20
> org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command 
> provided 0
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
> tdwadmin
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create 
> directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.IOException: Application application_1431957472783_38706012 
> initialization failed (exitCode=20) with output: main : command provided 0
> main : user is tdwadmin
> Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such 
> file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
> Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
> ... 1 more
> 2015-06-19 18:00:01,216 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1431957472783_38706012_01_000465 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597532#comment-14597532
 ] 

Hudson commented on YARN-3835:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597534#comment-14597534
 ] 

Hudson commented on YARN-3842:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* hadoop-yarn-project/CHANGES.txt


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui bug on main view after application number 9999

2015-06-23 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597527#comment-14597527
 ] 

Devaraj K commented on YARN-3840:
-

[~Alexandre LINTE], This seems to be sorting issue with respect to the app ids. 
It is just considering the first four digits of the application number for 
sorting in ascending/descing order, due to this it is not showing the 
application ids having more that  based on the order and it is mixing up 
with the other apps considering only the first 4 digits. You can see the 
attached image RMApps.png which shows displaying the apps having id> with 
other apps,

!RMApps.png|thumbnail!

Please check and confirm whether is it happening same or not in your case by 
searching for the specific app id in the search box. Thanks.

> Resource Manager web ui bug on main view after application number 
> --
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Centos 6.6
> Java 1.7
>Reporter: LINTE
> Attachments: RMApps.png
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3840) Resource Manager web ui bug on main view after application number 9999

2015-06-23 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3840:

Attachment: RMApps.png

> Resource Manager web ui bug on main view after application number 
> --
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Centos 6.6
> Java 1.7
>Reporter: LINTE
> Attachments: RMApps.png
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597512#comment-14597512
 ] 

Hudson commented on YARN-3842:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/967/])
YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via 
kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java


> NMProxy should retry on NMNotYetReadyException
> --
>
> Key: YARN-3842
> URL: https://issues.apache.org/jira/browse/YARN-3842
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, 
> YARN-3842.001.patch, YARN-3842.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597510#comment-14597510
 ] 

Hudson commented on YARN-3835:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/967/])
YARN-3835. hadoop-yarn-server-resourcemanager test package bundles 
core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 
99271b762129d78c86f3c9733a24c77962b0b3f7)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


> hadoop-yarn-server-resourcemanager test package bundles core-site.xml, 
> yarn-site.xml
> 
>
> Key: YARN-3835
> URL: https://issues.apache.org/jira/browse/YARN-3835
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Vamsee Yarlagadda
>Assignee: Vamsee Yarlagadda
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3835.patch
>
>
> It looks like by default yarn is bundling core-site.xml, yarn-site.xml in 
> test artifact of hadoop-yarn-server-resourcemanager which means that any 
> downstream project which uses this a dependency can have a problem in picking 
> up the user supplied/environment supplied core-site.xml, yarn-site.xml
> So we should ideally exclude these .xml files from being bundled into the 
> test-jar. (Similar to YARN-1748)
> I also proactively looked at other YARN modules where this might be 
> happening. 
> {code}
> vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml"
> ./hadoop-yarn/conf/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml
> ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml
> {code}
> And out of these only two modules (hadoop-yarn-server-resourcemanager, 
> hadoop-yarn-server-tests) are building test-jars. In future, if we start 
> building test-jar of other modules, we should exclude these xml files from 
> being bundled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597488#comment-14597488
 ] 

Hadoop QA commented on YARN-3045:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 48s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   7m 58s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 49s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 32s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  2s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 59s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   9m 17s | Tests failed in 
hadoop-yarn-applications-distributedshell. |
| {color:green}+1{color} | yarn tests |   6m 10s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  54m 27s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.applications.distributedshell.TestDistributedShell |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12740912/YARN-3045-YARN-2928.004.patch
 |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | YARN-2928 / 84f37f1 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8325/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8325/console |


This message was automatically generated.

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-23 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3045:

Labels:   (was: BB2015-05-TBR)

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597459#comment-14597459
 ] 

Hadoop QA commented on YARN-2871:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   6m 41s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 45s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 40s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  69m 38s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741254/YARN-2871.000.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 41ae776 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8324/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8324/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8324/console |


This message was automatically generated.

> TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
> -
>
> Key: YARN-2871
> URL: https://issues.apache.org/jira/browse/YARN-2871
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2871.000.patch
>
>
> From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
> {code}
> Failed tests:
>   TestRMRestart.testRMRestartGetApplicationList:957
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597392#comment-14597392
 ] 

zhihai xu commented on YARN-2871:
-

I uploaded a patch YARN-2871.000.patch for review.

> TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
> -
>
> Key: YARN-2871
> URL: https://issues.apache.org/jira/browse/YARN-2871
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2871.000.patch
>
>
> From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
> {code}
> Failed tests:
>   TestRMRestart.testRMRestartGetApplicationList:957
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2871:

Attachment: YARN-2871.000.patch

> TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
> -
>
> Key: YARN-2871
> URL: https://issues.apache.org/jira/browse/YARN-2871
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2871.000.patch
>
>
> From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
> {code}
> Failed tests:
>   TestRMRestart.testRMRestartGetApplicationList:957
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597368#comment-14597368
 ] 

zhihai xu commented on YARN-2871:
-

I can work on this issue, Based on the failure logs 
https://builds.apache.org/job/PreCommit-YARN-Build/8323/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testRMRestartGetApplicationList_1_/,
 the root cause of this issue is a race condition in the test. 
{{logApplicationSummary}} is called when RMAppManager handles APP_COMPLETED 
RMAppManagerEvent. RMAppImpl sends APP_COMPLETED event to AsyncDispatcher 
thread. If AsyncDispatcher thread doesn't process APP_COMPLETED event on time, 
then the test will fail. I think If we add some delay before the verification, 
it will fix this issue.
The important logs from failed test:
{code}
2015-06-23 06:06:20,484 INFO  [Thread-693] resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(572)) - Recovery started
2015-06-23 06:06:20,484 INFO  [Thread-693] 
security.RMDelegationTokenSecretManager 
(RMDelegationTokenSecretManager.java:recover(178)) - recovering 
RMDelegationTokenSecretManager.
2015-06-23 06:06:20,484 INFO  [Thread-693] resourcemanager.RMAppManager 
(RMAppManager.java:recover(425)) - Recovering 3 applications
2015-06-23 06:06:20,485 DEBUG [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1435039562888_0001 of type RECOVER
2015-06-23 06:06:20,485 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0001 
with 1 attempts and final state = FINISHED
2015-06-23 06:06:20,485 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:recover(827)) - Recovering attempt: 
appattempt_1435039562888_0001_01 with final state: FINISHED
2015-06-23 06:06:20,485 DEBUG [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(781)) - Processing event for 
appattempt_1435039562888_0001_01 of type RECOVER
2015-06-23 06:06:20,486 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0001_01 
State change from NEW to FINISHED
2015-06-23 06:06:20,486 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(768)) - application_1435039562888_0001 State change from 
NEW to FINISHED
2015-06-23 06:06:20,486 DEBUG [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1435039562888_0002 of type RECOVER
2015-06-23 06:06:20,486 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0002 
with 1 attempts and final state = FAILED
2015-06-23 06:06:20,487 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:recover(827)) - Recovering attempt: 
appattempt_1435039562888_0002_01 with final state: FAILED
2015-06-23 06:06:20,487 DEBUG [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(781)) - Processing event for 
appattempt_1435039562888_0002_01 of type RECOVER
2015-06-23 06:06:20,487 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0002_01 
State change from NEW to FAILED
2015-06-23 06:06:20,487 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(768)) - application_1435039562888_0002 State change from 
NEW to FAILED
2015-06-23 06:06:20,488 DEBUG [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1435039562888_0003 of type RECOVER
2015-06-23 06:06:20,488 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0003 
with 1 attempts and final state = KILLED
2015-06-23 06:06:20,488 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:recover(827)) - Recovering attempt: 
appattempt_1435039562888_0003_01 with final state: KILLED
2015-06-23 06:06:20,489 DEBUG [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(781)) - Processing event for 
appattempt_1435039562888_0003_01 of type RECOVER
2015-06-23 06:06:20,489 INFO  [Thread-693] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0003_01 
State change from NEW to KILLED
2015-06-23 06:06:20,489 INFO  [Thread-693] rmapp.RMAppImpl 
(RMAppImpl.java:handle(768)) - application_1435039562888_0003 State change from 
NEW to KILLED
2015-06-23 06:06:20,489 INFO  [Thread-693] resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(579)) - Recovery ended
2015-06-23 06:06:20,489 DEBUG [Thread-693] service.CompositeService 
(CompositeService.java:serviceStart(115)) - RMActiveServices: starting 
services, size=15
2015-06-23 06:06:20,489 INFO  [Thread-693] 
security.RMContainerTokenSecretManager 
(RMContainerTokenSecretManager.java:rollMasterKey(105)) - Rolling master-key 
for container-tokens
2015-06-23 06:06:20,4

[jira] [Assigned] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-2871:
---

Assignee: zhihai xu

> TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
> -
>
> Key: YARN-2871
> URL: https://issues.apache.org/jira/browse/YARN-2871
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: zhihai xu
>Priority: Minor
>
> From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
> {code}
> Failed tests:
>   TestRMRestart.testRMRestartGetApplicationList:957
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-23 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597321#comment-14597321
 ] 

Rakesh R commented on YARN-3798:


Sorry, I missed your comment. If curator sync up the data it would be fine. 
Otherwise there could be a chance of lag like we discussed earlier. Truly I 
haven't tried Curator yet, probably some one can cross check this part.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 10:09:44,887 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> 2015-06-09 10:09:44,887 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
> updating appAttempt: appattempt_1433764310492_7152_01
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.mult

[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-23 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597275#comment-14597275
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Result with test-patch.sh against branch-2.7 is as follows:
{quote}
$ dev-support/test-patch.sh ../YARN-3798-2.7.002.patch 
...
-1 overall.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated 48 warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version ) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.
{quote}

javadoc warning is not related to the patch since it doesn't change any 
signatures and javadocs.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hado

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597268#comment-14597268
 ] 

Varun Saxena commented on YARN-2902:


Sorry meant below.
2. On Heartbeat from container localizer, if localizer runner is already 
stopped, we can indicate the {color:red}container localizer{color} to do the 
cleanup for downloading resources.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> 
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)