date:20160307

[jira] [Updated] (YARN-4687) Document Reservation ACLs

2016-03-07 Thread Sean Po (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Po updated YARN-4687:
--
Attachment: YARN-4687.v3.patch

Addressing mvn site issue by escaping > and < characters.

> Document Reservation ACLs
> -
>
> Key: YARN-4687
> URL: https://issues.apache.org/jira/browse/YARN-4687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sean Po
>Assignee: Sean Po
>Priority: Minor
> Attachments: YARN-4687.v1.patch, YARN-4687.v2.patch, 
> YARN-4687.v3.patch
>
>
> YARN-2575 introduces ACLs for ReservationSystem. This JIRA is for adding 
> documentation on how to configure the ACLs for Capacity/Fair schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4715) Add support to read resource types from a config file

2016-03-07 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184497#comment-15184497
 ] 

Wangda Tan commented on YARN-4715:
--

Patch looks good, +1, thanks [~vvasudev].

> Add support to read resource types from a config file
> -
>
> Key: YARN-4715
> URL: https://issues.apache.org/jira/browse/YARN-4715
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4715-YARN-3926.001.patch, 
> YARN-4715-YARN-3926.002.patch, YARN-4715-YARN-3926.003.patch, 
> YARN-4715-YARN-3926.004.patch, YARN-4715-YARN-3926.005.patch
>
>
> This ticket is to add support to allow the RM to read the resource types to 
> be used for scheduling from a config file. I'll file follow up tickets to add 
> similar support in the NM as well as to handle the RM-NM handshake protocol 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184479#comment-15184479
 ] 

Jian He commented on YARN-4764:
---

Makes sense to me, thank you for the summarization !

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184445#comment-15184445
 ] 

Bibin A Chundatt commented on YARN-4764:


[~sunilg]/[~jianhe]
Thank you for discussing with summarizing. So the currently implemented patch 
is fine rt?

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184440#comment-15184440
 ] 

Sunil G commented on YARN-4764:
---

Thanks [~bibinchundatt] for the analysis and suggestions..

I will try summarizing the discussion so far. Agreeing to the fact that the 
current  behavior is inconsistent, we have to give similar way of handling for 
queue-non-existent scenario (with and w/o ACL). So we had 2 options.

1. For CS alone, queue-non-existent check could be done in 
{{createAndPopulateRMApp}}. Being said this, it will be common for apps with or 
w/o ACL enabled. This will make App to be rejected before even RMApp is 
created, hence audit logging is needed.
2. Or we could skip the ACL check if queue is non-existent and can pass to 
Scheduler inside so that it can send {{APP_REJECT}}. This will be inline with 
the old behavior. A minor drawback will be like, we know queue is not existing, 
and still we send scheduler for a know failure handling.

I had an offline talk with [~jianhe] also on this. May be we can go with Option 
2 for now. This will make a consistent behavior. But we need to improve here. 
So I can raise an improvement ticket and all queue related validation check can 
be done in a new YarnScheduler api. We can see how much we can make it common 
for Fair and CS too.
Thoughts?


> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2016-03-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184420#comment-15184420
 ] 

Hadoop QA commented on YARN-4117:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 52s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 9s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 46s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 34s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 3 new + 
55 unchanged - 0 fixed = 58 total (was 55) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 12s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 14s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_74. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 27s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 2s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 12s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} |

[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-07 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184400#comment-15184400
 ] 

Naganarasimha G R commented on YARN-4712:
-

Thanks [~sjlee0] for the clarification with example and agree that there is no 
point in summing up {{cpuUsageTotalCoresPercentage}} but there are issues with 
other approach( aggregating *cpuUsagePercentPerCore*) which you have also 
mentioned
# total number of cores for the cluster would be required to actually arrive at 
a percentage and also it would involve complexities when nodes go down 
intermittently before the aggregation and effective usage should not be more 
than the actual cluster cores 
# total number of cores should not be just the machines cores but actually 
which is made available to YARN (*nodeCpuPercentageForYARN*)
# Still it might not be better comparison as  the type of the core might also 
be different in a heterogenous cluster. Usually we try to have some 
multiplication factor so that vcores overall match. so may be we need to 
consider this factor too ?

But anyway this would require more discussion on what would be the right one to 
choose so i will raise a new jira as this bug we have currently will block some 
one testing the ATSv2 


> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table

2016-03-07 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184269#comment-15184269
 ] 

Sangjin Lee commented on YARN-4062:
---

Sorry [~vrushalic] it took a while to get to this. It looks good for the most 
part. I do have some questions and minor comments.

(YarnConfiguration.java)
- please define it in yarn-default.xml as well

(FlowScannerOperation.java)
- l.38, 43: typo: "Comapction" -> "compaction"

(FlowRunColumnPrefix.java)
- l.43: if we're removing the aggregation operation from the only enum entry, 
should we remove the aggregation operation ({{aggOp}}) completely from this 
class?

(FlowRunCoprocessor.java)
- l.268: Logging of {{request.isMajor()}} seems a bit cryptic. It would print 
strings like "Compactionrequest= ... true for ObserverContext ...". Should we 
do something like {{request.isMajor() ? "major compaction" : "minor 
compaction"}} instead? And did you mean it to be an INFO logging statement?

(TimelineStorageUtils.java)
- l.524: nit: we probably don't need that beginning space

(FlowScanner.java)
- l.79: we should not set the action by default, as it is always passed in 
through the constructor
- l.178: Shouldn't this be {{cellLimit}} still? {{limit}} is not an argument of 
this method.
- l.220: could you elaborate on why this change is needed? I'm generally not 
too clear on the difference between {{cellLimit}} and {{limit}}.  How are these 
values different and under what conditions?
- l.233: missing the row key value?
- l.381-382: should we throw an exception as this is not possible (currently)?
- l.484: Although this would be true for the most part, {{sum.longValue() > 
0L}} is not equivalent to summations being performed on compaction here, 
especially if there were negative individual values for example. Should we 
introduce a boolean flag to denote the case where summations were performed 
instead?

(TestHBaseStorageFlowRunCompaction.java)
- there are 3 unused imports


> Add the flush and compaction functionality via coprocessors and scanners for 
> flow run table
> ---
>
> Key: YARN-4062
> URL: https://issues.apache.org/jira/browse/YARN-4062
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4062-YARN-2928.04.patch, 
> YARN-4062-YARN-2928.05.patch, YARN-4062-YARN-2928.06.patch, 
> YARN-4062-YARN-2928.07.patch, YARN-4062-YARN-2928.1.patch, 
> YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch, 
> YARN-4062-feature-YARN-2928.03.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into 
> the flow_run table. It also needs a flush & compaction processing in the 
> coprocessor and perhaps a new scanner to deal with the data during flushing 
> and compaction stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184254#comment-15184254
 ] 

Bibin A Chundatt commented on YARN-4764:


# Also only for capacity scheduler we will be doing validation in 
{{createAndPopulateRMApp}} .
i think {{APP_REJECT}}  to be a better approach.

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184239#comment-15184239
 ] 

Bibin A Chundatt commented on YARN-4764:


[~jianhe]

# The behaviour will be not be same as previous versions.
# If we are planning to add validation in {{createAndPopulateRMApp}} then cases 
like *invalid queue, submit to parent queue etc* need to be added in 
{{createAndPopulateRMApp}} same set of checks we are adding in too many places.

i think we should keep the behaviour as old and only the acl should be handled 
in {{createAndPopulateRMApp}}.
 

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4770) Auto-restart of containers should work across NM restarts.

2016-03-07 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184195#comment-15184195
 ] 

Jun Gong commented on YARN-4770:


Thanks [~vinodkv] for reporting the issue. The patch in YARN-3998 should have 
handled this case.

{quote}
The relaunch feature needs to work across NM restarts, so we should save the 
retry-context and policy per container into the state-store and reload it for 
continue relaunching after NM restart.
{quote}
As [~vvasudev] said, "The container retry policy details are already stored in 
the state-store as part of the ContainerLaunchContext", so we do not need care 
it.

{quote}
We should also handle restarting of any containers that may have crashed during 
the NM reboot.
{quote}
If container crashed during the NM reboot, container would transit to 
RELAUNCHING state. I will check it again. 

> Auto-restart of containers should work across NM restarts.
> --
>
> Key: YARN-4770
> URL: https://issues.apache.org/jira/browse/YARN-4770
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> See my comment 
> [here|https://issues.apache.org/jira/browse/YARN-3998?focusedCommentId=15133367=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133367]
>  on YARN-3998. Need to take care of two things:
>  - The relaunch feature needs to work across NM restarts, so we should save 
> the retry-context and policy per container into the state-store and reload it 
> for continue relaunching after NM restart.
>  - We should also handle restarting of any containers that may have crashed 
> during the NM reboot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-07 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184167#comment-15184167
 ] 

Jun Gong commented on YARN-3998:


Thanks [~vvasudev] and [~vinodkv] for the comments and suggestions. I will 
update the patch.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4687) Document Reservation ACLs

2016-03-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184153#comment-15184153
 ] 

Hadoop QA commented on YARN-4687:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
54s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 14s 
{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 8s 
{color} | {color:red} hadoop-yarn-site in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 8m 0s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791868/YARN-4687.v2.patch |
| JIRA Issue | YARN-4687 |
| Optional Tests |  asflicense  mvnsite  |
| uname | Linux 5dc6dd5c48e0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 391da36 |
| mvnsite | 
https://builds.apache.org/job/PreCommit-YARN-Build/10718/artifact/patchprocess/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-site.txt
 |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10718/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> Document Reservation ACLs
> -
>
> Key: YARN-4687
> URL: https://issues.apache.org/jira/browse/YARN-4687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sean Po
>Assignee: Sean Po
>Priority: Minor
> Attachments: YARN-4687.v1.patch, YARN-4687.v2.patch
>
>
> YARN-2575 introduces ACLs for ReservationSystem. This JIRA is for adding 
> documentation on how to configure the ACLs for Capacity/Fair schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2016-03-07 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-4117:
---
Attachment: YARN-4117.v2.patch

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-4117.v0.patch, YARN-4117.v1.patch, 
> YARN-4117.v2.patch
>
>
> YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to 
> end unit test using mini YARN cluster to the AMRMProxy service. This test 
> will validate register, allocate and finish application and token renewal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4687) Document Reservation ACLs

2016-03-07 Thread Sean Po (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184047#comment-15184047
 ] 

Sean Po commented on YARN-4687:
---

Thanks for the review and comments [~asuresh], I've updated the patch with your 
comments.

> Document Reservation ACLs
> -
>
> Key: YARN-4687
> URL: https://issues.apache.org/jira/browse/YARN-4687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sean Po
>Assignee: Sean Po
>Priority: Minor
> Attachments: YARN-4687.v1.patch, YARN-4687.v2.patch
>
>
> YARN-2575 introduces ACLs for ReservationSystem. This JIRA is for adding 
> documentation on how to configure the ACLs for Capacity/Fair schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4687) Document Reservation ACLs

2016-03-07 Thread Sean Po (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Po updated YARN-4687:
--
Attachment: YARN-4687.v2.patch

> Document Reservation ACLs
> -
>
> Key: YARN-4687
> URL: https://issues.apache.org/jira/browse/YARN-4687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sean Po
>Assignee: Sean Po
>Priority: Minor
> Attachments: YARN-4687.v1.patch, YARN-4687.v2.patch
>
>
> YARN-2575 introduces ACLs for ReservationSystem. This JIRA is for adding 
> documentation on how to configure the ACLs for Capacity/Fair schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-07 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183990#comment-15183990
 ] 

Li Lu commented on YARN-4696:
-

Reformatting the previous comment:

FileSystemTimelineWriter.java
- TIMELINE_SERVICE_ENTITYFILE_FS_SUPPORT_APPEND move to YarnConfiguration?
- Why LogFDsCache#flush was changed into synchronized? I believe we're doing 
fine-grained locking here (with each of the FDs), and only flush in LogFDsCache 
is marked as synchronized? What am I missing here? 

TimelineWriter.java
- Not sure if "Direct timeline writer" is clear enough to indicate where the 
data goes to and which pattern the writer is following? By saying "direct" 
here, do we mean we're using a write-through strategy? 

EntityGroupFSTimelineStore.java
- In scanActiveLogs, the new variable "scanned" looks like a little bit 
confusing: when we return the variable scanned, the actual scanning jobs are 
not guaranteed to be done. So it looks like something "to be scanned" when we 
return? My only concern is this naming may give people false indication that by 
the time this method returns, there are a number of logs that are already 
scanned. This also applies to EntityLogScanner. 

> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch, YARN-4696-012.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-07 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183986#comment-15183986
 ] 

Li Lu commented on YARN-4696:
-

Thanks [~ste...@apache.org] for the work! Mostly LGTM, only a few nits:

FileSystemTimelineWriter.java
- TIMELINE_SERVICE_ENTITYFILE_FS_SUPPORT_APPEND move to YarnConfiguration?
- Why LogFDsCache#flush was changed into synchronized? I believe we're doing 
fine-grained locking here (with each of the FDs), and only flush in LogFDsCache 
is marked as synchronized? What am I missing here? cc/[~xgong]
TimelineWriter.java
- Not sure if "Direct timeline writer" is clear enough to indicate where the 
data goes to and which pattern the writer is following? By saying "direct" 
here, do we mean we're using a write-through strategy? 
EntityGroupFSTimelineStore.java
- In scanActiveLogs, the new variable "scanned" looks like a little bit 
confusing: when we return the variable scanned, the actual scanning jobs are 
not guaranteed to be done. So it looks like something "to be scanned" when we 
return? My only concern is this naming may give people false indication that by 
the time this method returns, there are a number of logs that are already 
scanned. This also applies to EntityLogScanner. 

> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch, YARN-4696-012.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1547) Prevent DoS of ApplicationMasterProtocol by putting in limits

2016-03-07 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-1547:
---
Attachment: YARN-1547.pdf

> Prevent DoS of ApplicationMasterProtocol by putting in limits
> -
>
> Key: YARN-1547
> URL: https://issues.apache.org/jira/browse/YARN-1547
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-1547.pdf
>
>
> Points of DoS in ApplicationMasterProtocol
>  - Host and trackingURL in RegisterApplicationMasterRequest
>  - Diagnostics, final trackingURL in FinishApplicationMasterRequest
>  - Unlimited number of resourceAsks, containersToBeReleased and 
> resourceBlacklistRequest in AllocateRequest
> -- Unbounded number of priorities and/or resourceRequests in each ask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183684#comment-15183684
 ] 

Vinod Kumar Vavilapalli commented on YARN-4721:
---

bq. rather than the more fundamental "your RM doesn't have the credentials to 
talk to HDFS"
The thing is YARN is built agnostic of file-systems and your proposal of "ls /" 
breaks this very fundamental assumption - that is why I am against it. One 
could argue for a stand-alone service (outside of YARN) that does these 
validations.

There are apps that do not depend on file-systems - for e.g. Samza. And there 
are apps that depend on multiple file-systems - for e.g. distcp. So, the notion 
of "this cluster cannot talk to my HDFS" doesn't generalize. It is context 
dependent and almost always "may app cannot talk to this and that HDFS 
instances".

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183634#comment-15183634
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

Okay, seems like we are in general agreement.
 - Let's try to look at the unification part in this JIRA itself as it concerns 
public facing APIs - even if we don't unify them completely here.
 - I filed YARN-4770 and assigned to myself. [~hex108] / [~vvasudev], feel free 
to take them over as you start working on them.
 - [~vvasudev], please also file a sub-task under YARN-4725 for the LCE related 
work. Tx.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4770) Auto-restart of containers should work across NM restarts.

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)

Vinod Kumar Vavilapalli created YARN-4770:
-

 Summary: Auto-restart of containers should work across NM restarts.
 Key: YARN-4770
 URL: https://issues.apache.org/jira/browse/YARN-4770
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


See my comment 
[here|https://issues.apache.org/jira/browse/YARN-3998?focusedCommentId=15133367=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133367]
 on YARN-3998. Need to take care of two things:
 - The relaunch feature needs to work across NM restarts, so we should save the 
retry-context and policy per container into the state-store and reload it for 
continue relaunching after NM restart.
 - We should also handle restarting of any containers that may have crashed 
during the NM reboot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-07 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183611#comment-15183611
 ] 

Sangjin Lee commented on YARN-4712:
---

My main concern with using {{cpuUsageTotalCoresPercentage}} is about 
*aggregation*, and I think using {{cpuUsageTotalCoresPercentage}} breaks down 
in a heterogeneous cluster. Here is an illustrative example.

Suppose you have a 2-node cluster, where the first node has 4 cores and the 
second node has 8 cores. Furthermore, suppose that the container on the 4-core 
node is utilizing all 4 cores and the container on the 8-core node is utilizing 
1 core. Since the entire cluster has 12 cores and the app is using 5 cores, the 
utilization of this app should be 42% (5/12 cores).

However, if we use {{cpuUsageTotalCoresPercentage}}, we have a problem. The 
container on the 4-core node will report 100% utilization on that node, and the 
other container on the 8-core node will report 12.5% utilization. Then, if we 
aggregated the container metrics to the app, the app would have 112.5% 
utilization of the cluster or 56% per node. IMO this is not correct, or at best 
misleading.

If the node capacity in terms of cores is homogeneous, it does not make a 
difference by using either. However, if we have a heterogeneous cluster, the 
latter essentially under-weighs larger machines by using *simple* averages. 
This would result in a misleading and confusing result on aggregation.

I do recognize using {{cpuUsagePercentPerCore}} would require the total number 
of cores for the cluster when aggregated to arrive at a relative percentage 
number. But overall I do feel that {{cpuUsagePercentPerCore}} would be a more 
accurate measure of the cluster utilization when aggregated.

I am OK with separating that discussion to another JIRA.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on

2016-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183596#comment-15183596
 ] 

Hudson commented on YARN-4762:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9437 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9437/])
YARN-4762. Fixed CgroupHandler's creation and usage to avoid (vinodkv: rev 
b2661765a5a48392a5691cee15904ed2de147b00)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DelegatingLinuxContainerRuntime.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/TestDockerContainerRuntime.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/ResourceHandlerModule.java


> NMs failing on DelegatingLinuxContainerRuntime init with LCE on
> ---
>
> Key: YARN-4762
> URL: https://issues.apache.org/jira/browse/YARN-4762
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sidharta Seethana
>Priority: Blocker
> Fix For: 2.9.0
>
> Attachments: YARN-4762.001.patch, YARN-4762.002.patch
>
>
> Seeing this exception and the NMs crash.
> {code}
> 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> is started
> 2016-03-03 16:47:58,027 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> checkLinuxExecutorSetup: 
> [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup]
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Mount point Based on mtab file: /proc/mounts. Controller mount point not 
> writable for: cpu
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Unable to get cgroups handle.
> 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: 
> noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to 
> initialize container executor
> 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: 
> Service NodeManager failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize linux container 
> runtime(s)!
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
> ... 3 more
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: NodeManager entered state STOPPED
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: 
> NodeManager: stopping services, size=0
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> entered state STOPPED
> 2016-03-03 16:47:58,047 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
> NodeManager
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize

[jira] [Commented] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on

2016-03-07 Thread Sidharta Seethana (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183592#comment-15183592
 ] 

Sidharta Seethana commented on YARN-4762:
-

Thanks, [~vinodkv] !

> NMs failing on DelegatingLinuxContainerRuntime init with LCE on
> ---
>
> Key: YARN-4762
> URL: https://issues.apache.org/jira/browse/YARN-4762
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sidharta Seethana
>Priority: Blocker
> Fix For: 2.9.0
>
> Attachments: YARN-4762.001.patch, YARN-4762.002.patch
>
>
> Seeing this exception and the NMs crash.
> {code}
> 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> is started
> 2016-03-03 16:47:58,027 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> checkLinuxExecutorSetup: 
> [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup]
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Mount point Based on mtab file: /proc/mounts. Controller mount point not 
> writable for: cpu
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Unable to get cgroups handle.
> 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: 
> noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to 
> initialize container executor
> 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: 
> Service NodeManager failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize linux container 
> runtime(s)!
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
> ... 3 more
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: NodeManager entered state STOPPED
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: 
> NodeManager: stopping services, size=0
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> entered state STOPPED
> 2016-03-03 16:47:58,047 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
> NodeManager
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize linux container 
> runtime(s)!
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
> ... 3 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183545#comment-15183545
 ] 

Jian He commented on YARN-4764:
---

bq. Is it better if we throw exception for CS in 
RMAppManager#createAndPopulateRMApp itself?
I think it is fine to fail the createAndPopulateRMApp call itself.

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-07 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183466#comment-15183466
 ] 

Sangjin Lee commented on YARN-4761:
---

Thanks [~zxu]!

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Fix For: 2.8.0, 2.7.3, 2.6.5
>
> Attachments: YARN-4761.01.patch, YARN-4761.02.patch
>
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183459#comment-15183459
 ] 

Vinod Kumar Vavilapalli commented on YARN-4762:
---

The latest patch looks good to me, +1. Checking this in.

> NMs failing on DelegatingLinuxContainerRuntime init with LCE on
> ---
>
> Key: YARN-4762
> URL: https://issues.apache.org/jira/browse/YARN-4762
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sidharta Seethana
>Priority: Blocker
> Attachments: YARN-4762.001.patch, YARN-4762.002.patch
>
>
> Seeing this exception and the NMs crash.
> {code}
> 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> is started
> 2016-03-03 16:47:58,027 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> checkLinuxExecutorSetup: 
> [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup]
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Mount point Based on mtab file: /proc/mounts. Controller mount point not 
> writable for: cpu
> 2016-03-03 16:47:58,043 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Unable to get cgroups handle.
> 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: 
> noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to 
> initialize container executor
> 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: 
> Service NodeManager failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize linux container 
> runtime(s)!
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
> ... 3 more
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: NodeManager entered state STOPPED
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: 
> NodeManager: stopping services, size=0
> 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
> entered state STOPPED
> 2016-03-03 16:47:58,047 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
> NodeManager
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
> container executor
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
> Caused by: java.io.IOException: Failed to initialize linux container 
> runtime(s)!
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
> ... 3 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4687) Document Reservation ACLs

2016-03-07 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183432#comment-15183432
 ] 

Arun Suresh commented on YARN-4687:
---

Thanks for the patch [~seanpo03]. Couple of nits :

* In the Capacity Scheduler section, Give an example of 
* Dont think you need to make *ACL* bold
* In the fair Scheduler section, replace *Anybody who may administer a queue 
may also submit* with *An administrator may also submit*
* In the fair Scheduler section, replace *An action on a queue will be 
permitted if its user or group is in the ACL of that queue* with *Actions on a 
queues are permitted if the user/qroup is a member of queue ACL*

> Document Reservation ACLs
> -
>
> Key: YARN-4687
> URL: https://issues.apache.org/jira/browse/YARN-4687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sean Po
>Assignee: Sean Po
>Priority: Minor
> Attachments: YARN-4687.v1.patch
>
>
> YARN-2575 introduces ACLs for ReservationSystem. This JIRA is for adding 
> documentation on how to configure the ACLs for Capacity/Fair schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183401#comment-15183401
 ] 

Steve Loughran commented on YARN-4696:
--

ignore the comment about numbers for now, something publish-side that I need to 
analyse more

> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch, YARN-4696-012.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183380#comment-15183380
 ] 

Hadoop QA commented on YARN-4764:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 38s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 18s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 0s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 165m 0s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_74 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791775/0002-YARN-4764.patch |
| JIRA Issue | YARN-4764 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
|

[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-07 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183343#comment-15183343
 ] 

Varun Saxena commented on YARN-4712:


Thanks [~Naganarasimha] for the patch.
bq. IMO cpuUsageTotalCoresPercentage is important to gauge how much of the 
cluster's CPU is getting utlized, if its cpuUsagePercentPerCore i beleive it 
doesnt give the cluster's CPU on aggregation from all containers. Infact we 
need to report both and also IMO cpuUsageTotalCoresPercentage is not calculated 
properly it should be
In ContainersMonitorImpl, we calculate CPU per container process.
There are 2 primary CPU values here. 
{{cpuUsagePercentPerCore}} is similar to what we see in top command output i.e. 
if we have a 4 core machine, and 3 of the cores are used by a specific process, 
we will see CPU% as 300%.
This value on Linux will be calculated by reading {{/proc//stat}} from 
where we read the amount of time spent(in terms of CPU jiffies) in kernel and 
user space by the process. And get the effective CPU% based on the sample 
values read earlier and read now.
{{cpuUsageTotalCoresPercentage}} on the other hand is a normalized value based 
on number of processors configured for the node.
{{nodeCpuPercentageForYARN}} is a config which places an upper limit on CPU to 
be used by containers. IIUC this config has to be used with cgroups(although 
its not restricted in code).
This config is factored in while reporting CPU resource utilization in Node HB 
to RM.
In a heterogeneous cluster, all these 3 values maybe useful. Thoughts ?

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-07 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183328#comment-15183328
 ] 

Karthik Kambatla commented on YARN-4719:


Oh, and when the NodeFilter is null or an "IdentityFilter" (that always returns 
true), we could choose to skip iteration and just populate the list with the 
keySet(). That way, we don't need a separate method to return all nodes. 

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-07 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183324#comment-15183324
 ] 

Karthik Kambatla commented on YARN-4719:


bq. I may not understand about this, could you elaborate?
When we add or remove a node, a few things are updated:
# The map/set holding these nodes
# Total cluster capacity
# Total inflated cluster capacity (that takes allocated but not utilized 
resources into account)
# Largest container that could potentially be allocated

As long as it doesn't interfere with performance, I would like to keep all of 
this information consistent. That is if one thread has added a node, another 
thread shouldn't see a stale value for cluster capacity. Ensuring this 
consistency requires us to hold write locks when adding or removing a node, and 
read locks when accessing any of this information. Given that the current code 
has been ensuring consistency through a lock on the scheduler (at least 
FairScheduler), I doubt adding these read/write locks would lead to performance 
issues. I haven't done any benchmarking though. 

Do you agree with the approach of consistency-over-performance to begin with? 
If yes, we need the locks. And, making the HashMap a ConcurrentHashMap wouldn't 
buy us much. 

Coming to iterating over the nodes, I agree with your concern that we might 
proliferate ClusterNodeTracker with methods like addBlacklistedNodeIdsToList. 
And, understand your point that using a ConcurrentHashMap would allow us to 
iterate freely on the snapshot data without exposing the internals. That said, 
given that there are likely only a handful of cases where iterating through all 
the nodes is necessary and one of the goals for this library to help with 
preemption, node-labeling, and over-subscription, what do you think of the 
following construct:
{code}
  public List filter(NodeFilter nodeFilter) {
List nodeIds = new ArrayList<>();
readLock.lock();
try {
  for (N node : nodes.values()) {
if (nodeFilter.accept(node)) {
  nodeIds.add(node.getNodeID());
}
  }
} finally {
  readLock.unlock();
}
return nodeIds;
  }
{code}

{{addBlackListedNodeIdsToList}} can implement a trivial {{NodeFilter.accept}} 
as {{SchedulerAppUtils.isBlacklisted(app, node, LOG)}}. We have used this 
approach to great effect with PathFilter. 

Thoughts? 

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183256#comment-15183256
 ] 

Sunil G commented on YARN-4764:
---

[~bibinchundatt] and [~jianhe]
Is it better if we throw exception for CS in 
RMAppManager#createAndPopulateRMApp itself? Or any other advantages other than 
considering it as a Failed Application. May be an audit log can help in 
tracking such failures?

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183235#comment-15183235
 ] 

Steve Loughran commented on YARN-4696:
--

I should add that now I've started with some scale tests, my tests can't do a 
{{GET entity_type/appemptID}} and get more than about 750 events back. Is there 
an official upper limit on the number of events a single attempt can create? Or 
is this a limit in the web API? If it's the latter, is there a way we can do 
windowing here, otherwise you can't store large histories, even with the new 
store.

> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch, YARN-4696-012.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4760) proxy redirect to history server uses wrong URL

2016-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183186#comment-15183186
 ] 

Hudson commented on YARN-4760:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9435 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9435/])
YARN-4760. proxy redirect to history server uses wrong URL. Contributed (jlowe: 
rev 4163e36c2be2f562545aba93c1d47643a9ff4741)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java


> proxy redirect to history server uses wrong URL
> ---
>
> Key: YARN-4760
> URL: https://issues.apache.org/jira/browse/YARN-4760
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Fix For: 2.7.3
>
> Attachments: YARN-4760.001.patch
>
>
> YARN-3975 added the ability to redirect to the history server when an app 
> fails to specify a tracking URL and the RM has since forgotten about the 
> application.  However it redirects to /apps/ instead of /app/ 
> which is the wrong destination page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4760) proxy redirect to history server uses wrong URL

2016-03-07 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183169#comment-15183169
 ] 

Jason Lowe commented on YARN-4760:
--

+1 lgtm.  Committing this.

> proxy redirect to history server uses wrong URL
> ---
>
> Key: YARN-4760
> URL: https://issues.apache.org/jira/browse/YARN-4760
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Attachments: YARN-4760.001.patch
>
>
> YARN-3975 added the ability to redirect to the history server when an app 
> fails to specify a tracking URL and the RM has since forgotten about the 
> application.  However it redirects to /apps/ instead of /app/ 
> which is the wrong destination page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4764) Application submission fails when submitted queue is not available in scheduler xml

2016-03-07 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4764:
---
Attachment: 0002-YARN-4764.patch

Attaching patch with testcase.
Skipping acl check when {{csqueue==null}} so that the behaviour will be same as 
old.

> Application submission fails when submitted queue is not available in 
> scheduler xml
> ---
>
> Key: YARN-4764
> URL: https://issues.apache.org/jira/browse/YARN-4764
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4764.patch, 0002-YARN-4764.patch
>
>
> Available queues in capacity scheduler 
> -root
> --queue1
> --queue2
> Submit application with queue3
> {noformat}
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1457077554812_1901
> 16/03/04 16:40:08 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 3938 for 
> mapred with renewer yarn)
> 16/03/04 16:40:08 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication
>  over rm2. Not retrying because try once and fail.
> java.lang.NullPointerException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:366)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:289)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2301)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> {noformat}
> Should be queue doesnt exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-07 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-4696:
-
Attachment: YARN-4696-012.patch

Patch 012. This is what I've been successfully using in tests —giving up trying 
to have incomplete apps if the FS is LocalFileSystem, and instead using an 
MiniHDFSCluster for those test cases. The tests all work.

This is ready for review and ideally, getting into 2.8

> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch, YARN-4696-012.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4245) Clean up container-executor invocation interface

2016-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182884#comment-15182884
 ] 

Hudson commented on YARN-4245:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9432 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9432/])
YARN-4245. Generalize config file handling in container-executor. (vvasudev: 
rev 8ed2e060e80c0def3fcb7604e0bd27c1c24d291e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/configuration.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/configuration.h


> Clean up container-executor invocation interface
> 
>
> Key: YARN-4245
> URL: https://issues.apache.org/jira/browse/YARN-4245
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Sidharta Seethana
>Assignee: Sidharta Seethana
>
> The current container-executor invocation interface (especially for launch 
> container) is cumbersome to use . Launching a container now requires 13-15 
> arguments.  This becomes especially problematic when additional, potentially 
> optional, arguments are required. We need a better mechanism to deal with 
> this. One such mechanism could be to handle this could be to use a file 
> containing key/value pairs (similar to container-executor.cfg) corresponding 
> to the arguments each invocation needs. Such a mechanism would make it easier 
> to add new optional arguments to container-executor and better manage 
> existing ones. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4737) Add CSRF filter support in YARN

2016-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182827#comment-15182827
 ] 

Hudson commented on YARN-4737:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9431 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9431/])
YARN-4737. Add CSRF filter support in YARN. Contributed by Jonathan (vvasudev: 
rev e51a8c10560e5db5cf01fd530af48825cb51c9ea)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/WebServer.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/jobhistory/JHAdminConfig.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/webapp/TestRMWithCSRFFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/WebApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


> Add CSRF filter support in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Fix For: 2.9.0
>
> Attachments: YARN-4737.001.patch, YARN-4737.002.patch, 
> YARN-4737.003.patch, YARN-4737.004.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4769) Add support for CSRF header in the dump capacity scheduler logs and kill app buttons in RM web UI

2016-03-07 Thread Varun Vasudev (JIRA)

Varun Vasudev created YARN-4769:
---

 Summary: Add support for CSRF header in the dump capacity 
scheduler logs and kill app buttons in RM web UI
 Key: YARN-4769
 URL: https://issues.apache.org/jira/browse/YARN-4769
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev


YARN-4737 adds support for CSRF filters in YARN. If the CSRF filter is enabled, 
the current functionality to dump the capacity scheduler logs and kill an app 
from the RM web UI will not work due to the missing CSRF header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4737) Add CSRF filter support in YARN

2016-03-07 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4737:

Summary: Add CSRF filter support in YARN  (was: Use CSRF Filter in YARN)

> Add CSRF filter support in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch, YARN-4737.002.patch, 
> YARN-4737.003.patch, YARN-4737.004.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182810#comment-15182810
 ] 

Steve Loughran commented on YARN-4721:
--

this patch, initially, sets up kerberos diagnostics without side effects.

I'd also like to do, after this, an {{ls / }} of the filesystem. Maybe just 
make this another option to run if yarn.resourcemanager.kdiag.enabled=true .  
Against a kerberized FS this would trigger fast negotiation and, if there are 
problems report. (this would have to be done async, obviously). 

The problem with the current —talk-during-renew— process is that it means that 
if any problem surfaces, it doesn't surface until someone submits work. It then 
surfaces as "job submit failed", rather than the more fundamental "your RM 
doesn't have the credentials to talk to HDFS"

This would not be a hard coded binding to HDFS; purely a check that the cluster 
FS is readable by the RM principal

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

45 matches

Mail list logo