[jira] [Updated] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2716:
---
Issue Type: Improvement  (was: Bug)

 Refactor ZKRMStateStore retry code with Apache Curator
 --

 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Karthik Kambatla
 Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, 
 yarn-2716-prelim.patch, yarn-2716-super-prelim.patch


 Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
 simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2716:
---
Attachment: yarn-2716-1.patch

Fixed TestZKRMStateStoreZKClientConnections as well. I believe this patch the 
v1 patch could some more eyes. Appreciate any feedback. 

 Refactor ZKRMStateStore retry code with Apache Curator
 --

 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Karthik Kambatla
 Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, 
 yarn-2716-prelim.patch, yarn-2716-super-prelim.patch


 Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
 simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566993#comment-14566993
 ] 

Rohith commented on YARN-3585:
--

This is race condition when the NodeManager is shutting down and container is 
launched. By the time container is launched and returned to ContainerImpl, 
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed
}}

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566998#comment-14566998
 ] 

zhihai xu commented on YARN-3017:
-

Hi [~mufeed.usman], thanks for working on this issue.
The name for appAttemptIdAndEpochFormat will become confusing after the fix.
Could you rename appAttemptIdAndEpochFormat to epochFormat?

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567189#comment-14567189
 ] 

Rohith commented on YARN-3733:
--

bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 
Since infinity is derived from lhs and rhs, infinity can not be differentiated 
for the clusterResource=0,0 lhs=1,1, and rhs2,2. Method 
{{getResourceAsValue()}} return infinity for both l and r which cant compare it.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567196#comment-14567196
 ] 

Rohith commented on YARN-3585:
--

Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can 
validate the issue there?

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.2.patch

Upload a new patch to fix test cases.
Lots of previous tests The HA Configuration has multiple addresses that match 
local node's address. is because I forgot to set YarnConfiguration.RM_HA_ID 
before starting NM.

The patch also contains 2 minor fix, changed getting conf value of 
RM_SCHEDULER_ADDRESS from serviceStart to serviceInit in 
ApplicationMasterService, changed duplicates setRpcAddressForRM in tests to 
HAUtil.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567186#comment-14567186
 ] 

Rohith commented on YARN-3733:
--

bq. 2. The newly added code is duplicated in two places, can you eliminate the 
duplicate code?
sencond time validation is not required ICO NaN,will remove this in next patch.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567184#comment-14567184
 ] 

Rohith commented on YARN-3733:
--

Thanks [~devaraj.k] and [~sunilg] for review

bq. Can we check for lhs/rhs emptiness and compare these before ending up with 
infinite values? 
If we calculater for emptyness, this would affect specific input values like 
clusterResource=0,0 lhs=1,1, and rhs2,2. Then which one is considered as 
dominant? bcs directly dominant component can not be retrieved by memory or cpu.

And I listed out what are the possible combination of inputs would ocure in 
YARN. These are
||Sl.no||clusterResorce||lhs||rhs||Remark||
|1|0,0|0,0|0,0|Valid Input;Handled|
|2|0,0|positive integer,positive integer|0,0|NaN vs Infinity: Patch 
Handle This scenario|
|3|0,0|0,0|positive integer,positive integer|Nan vs Infinity: Patch 
Handle This scenario|
|4|0,0|positive integer,positive integer|positive integer,positive 
integer|Infinity vs Infinity: Can this type can ocur in YARN?|
|5|0,0|positive integer,0|0,positive integer|Is this valid input? Can 
this type can ocur in YARN?|


  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567201#comment-14567201
 ] 

Rohith commented on YARN-3585:
--

-1 for findbug, does not show any error report, but not sure why -1 given.
Test failure is unrelated to this patch.

[~jlowe] Kindly review the patch. 

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-01 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567038#comment-14567038
 ] 

Akira AJISAKA commented on YARN-3069:
-

Thanks [~rchiang] for updating the patch.
# Would you reflect the previous comment for 
{{yarn.node-labels.fs-store.retry-policy-spec}}?
# For YARN registry, the parameters are written in core-site.xml. Can we remove 
them from the patch?

My review is almost done.
@Watchers: I would appreciate if you could review this patch. It includes a lot 
of descriptions for parameters, so it should be reviewed by a lot of developers.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567157#comment-14567157
 ] 

Hudson commented on YARN-3725:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/945/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567024#comment-14567024
 ] 

Hadoop QA commented on YARN-2716:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 46s | The applied patch generated  3 
new checkstyle issues (total was 42, now 8). |
| {color:green}+1{color} | whitespace |   0m  5s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   1m 30s | The patch appears to introduce 2 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 23s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 44s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-resourcemanager |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736498/yarn-2716-1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 5cc3fce |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/console |


This message was automatically generated.

 Refactor ZKRMStateStore retry code with Apache Curator
 --

 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Karthik Kambatla
 Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, 
 yarn-2716-prelim.patch, yarn-2716-super-prelim.patch


 Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
 simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567054#comment-14567054
 ] 

Sunil G commented on YARN-3585:
---

Hi [~bibinchundatt] and [~rohithsharma]
This recent exception trace is different from the focus of  this Jira, and the 
root cause is given by Rohith. I feel you can separate this to another ticket.

For DB Close vs Container Launch, we can add a check whether DB is closed while 
we move container from ACQUIRED state.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2194:
---
  Component/s: nodemanager
  Description: 
In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
controller name leads to container launch failure. 

RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd 
has certain shortcomings as identified in this JIRA (see comments). 

This JIRA only fixes the failure, and doesn't try to use systemd.

  was:   In previous versions of RedHat, we can build custom cgroup hierarchies 
with use of the cgconfig command from the libcgroup package. From RedHat 7, 
package libcgroup is deprecated and it is not recommended to use it since it 
can easily create conflicts with the default cgroup hierarchy. The systemd is 
provided and recommended for cgroup management. We need to add support for this.

 Priority: Critical  (was: Major)
 Target Version/s: 2.8.0
Affects Version/s: 2.7.0
   Issue Type: Bug  (was: Improvement)

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567158#comment-14567158
 ] 

Hudson commented on YARN-2900:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/945/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567029#comment-14567029
 ] 

Karthik Kambatla commented on YARN-2194:


+1 otherwise. 

[~vinodkv], [~tucu00] - is this somewhat hacky approach reasonable? 

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-06-01 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567041#comment-14567041
 ] 

Akira AJISAKA commented on YARN-3432:
-

Thanks [~brahmareddy] for taking this issue. The patch seems to revert 
YARN-656. I'm thinking that's not fine because it will break FairScheduler. 
This issue should fix CapacityScheduler only.

 Cluster metrics have wrong Total Memory when there is reserved memory on CS
 ---

 Key: YARN-3432
 URL: https://issues.apache.org/jira/browse/YARN-3432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Brahma Reddy Battula
 Attachments: YARN-3432.patch


 I noticed that when reservations happen when using the Capacity Scheduler, 
 the UI and web services report the wrong total memory.
 For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
 and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
 This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
 there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2194:
---
Summary: Cgroups cease to work in RHEL7  (was: Add Cgroup support for 
RedHat 7)

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch


In previous versions of RedHat, we can build custom cgroup hierarchies 
 with use of the cgconfig command from the libcgroup package. From RedHat 7, 
 package libcgroup is deprecated and it is not recommended to use it since it 
 can easily create conflicts with the default cgroup hierarchy. The systemd is 
 provided and recommended for cgroup management. We need to add support for 
 this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567025#comment-14567025
 ] 

Karthik Kambatla commented on YARN-2194:


Verified the patch works. Can we add more comments to say clarify why the patch 
replaces cpu,cpuacct with cpu? May be something along the lines of - In RHEL7, 
the CPU controller is named 'cpu,cpuacct'. The comma in the controller name 
leads to container launch failure. Symlinks 'cpu' and 'cpuacct' point to 
'cpu,cpuacct'. Using 'cpu' solves the issue.

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-06-01 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567153#comment-14567153
 ] 

Brahma Reddy Battula commented on YARN-3528:


Thanks [~ste...@apache.org] and [~rkanter] for your inputs...Going to write one 
common utility where   
1) one method for 0 port,we can set back the port for same config
2)another method,As some places above one is not possible, we can use similiar 
way as [~rkanter] mentioned..

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test

 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567151#comment-14567151
 ] 

Hudson commented on YARN-2900:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567150#comment-14567150
 ] 

Hudson commented on YARN-3725:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567417#comment-14567417
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* hadoop-yarn-project/CHANGES.txt


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567448#comment-14567448
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* hadoop-yarn-project/CHANGES.txt


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2194:
--
Attachment: YARN-2194-3.patch

Thanks, [~kasha]. Updated a patch adding more comments.

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567416#comment-14567416
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567447#comment-14567447
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated YARN-3748:
---
Attachment: YARN-3748.4.patch

 Cleanup Findbugs volatile warnings
 --

 Key: YARN-3748
 URL: https://issues.apache.org/jira/browse/YARN-3748
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gabor Liptak
Priority: Minor
 Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, 
 YARN-3748.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567506#comment-14567506
 ] 

Hadoop QA commented on YARN-3748:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 58s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:red}-1{color} | javac |   3m 19s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736589/YARN-3748.4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8152/console |


This message was automatically generated.

 Cleanup Findbugs volatile warnings
 --

 Key: YARN-3748
 URL: https://issues.apache.org/jira/browse/YARN-3748
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gabor Liptak
Priority: Minor
 Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, 
 YARN-3748.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567573#comment-14567573
 ] 

Junping Du commented on YARN-3699:
--

Hi [~jrottinghuis] and [~vrushalic], thanks for your comments and sorry for 
replying late on this as traveling last week. 
I fully agree with Joep's above comments that there is no right or wrong schema 
but just fit-in one for priority scenarios:
- if we need more for flow_run under specific flow/flows, then making flow 
version as column will make this query more efficient.
- if we equally (or more) need for flow_run under specific flow version(s), 
then our decision here could be different.
To me, the tricky/interesting part here is the boundary between different flows 
and flow versions could vague in practice: How big/small changes we made on a 
flow should start a new flow or new flow version? Why we have more active flow 
versions instead of having only one active flow version (with adding more 
flows). These trade-offs in application concepts also affect our trade-off in 
schema design which is pretty common thing that I saw also from other apps.
I would like to trust your priority here given your experience from hRaven 
which is already in production running well for years. So I agree Phoenix 
schema should be adjusted slightly to get closed to HBase one. 
May be we should have a new JIRA for this (Phoenix schema) change? We can 
either keep this JIRA open for discussion or resolve it as later so in future, 
if others from community bring other solid scenarios in practice, we can 
continue the discussion here and try to make better trade-off or innovation. 
Thoughts?

 Decide if  flow version should be part of row key or column
 ---

 Key: YARN-3699
 URL: https://issues.apache.org/jira/browse/YARN-3699
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C

 Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
 discussion on putting the flow version in rowkey or column. 
 Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567542#comment-14567542
 ] 

Sunil G commented on YARN-3686:
---

Thank You [~leftnoteasy] for reviewing and committing the same!

 CapacityScheduler should trim default_node_label_expression
 ---

 Key: YARN-3686
 URL: https://issues.apache.org/jira/browse/YARN-3686
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
Priority: Critical
 Fix For: 2.7.1

 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 
 0003-YARN-3686.patch, 0004-YARN-3686.patch


 We should trim default_node_label_expression for queue before using it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3647) RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567546#comment-14567546
 ] 

Sunil G commented on YARN-3647:
---

Thank You [~leftnoteasy] for committing the patch.

 RMWebServices api's should use updated api from CommonNodeLabelsManager to 
 get NodeLabel object
 ---

 Key: YARN-3647
 URL: https://issues.apache.org/jira/browse/YARN-3647
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sunil G
Assignee: Sunil G
 Fix For: 2.8.0

 Attachments: 0001-YARN-3647.patch, 0002-YARN-3647.patch


 After YARN-3579, RMWebServices apis can use the updated version of apis in 
 CommonNodeLabelsManager which gives full NodeLabel object instead of creating 
 NodeLabel object from plain label name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev reassigned YARN-3542:
---

Assignee: Varun Vasudev

 Re-factor support for CPU as a resource using the new ResourceHandler 
 mechanism
 ---

 Key: YARN-3542
 URL: https://issues.apache.org/jira/browse/YARN-3542
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Sidharta Seethana
Assignee: Varun Vasudev
Priority: Critical

 In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
 addition of new resource types in the nodemanager (this was used for network 
 as a resource - See YARN-2140 ). We should refactor the existing CPU 
 implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
 the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567341#comment-14567341
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 54s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 55s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 13s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736559/YARN-3170-009.patch |
| Optional Tests | site |
| git revision | trunk / 63e3fee |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8150/console |


This message was automatically generated.

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567366#comment-14567366
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   3m  1s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 55s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 20s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736570/YARN-3170-009.patch |
| Optional Tests | site |
| git revision | trunk / 63e3fee |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8151/console |


This message was automatically generated.

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567380#comment-14567380
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: YARN-3170-009.patch

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: (was: YARN-3170-009.patch)

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI

2015-06-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567327#comment-14567327
 ] 

Anubhav Dhoot commented on YARN-3467:
-

Did you mean its very verbose?

 Expose allocatedMB, allocatedVCores, and runningContainers metrics on running 
 Applications in RM Web UI
 ---

 Key: YARN-3467
 URL: https://issues.apache.org/jira/browse/YARN-3467
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: webapp, yarn
Affects Versions: 2.5.0
Reporter: Anthony Rojas
Assignee: Anubhav Dhoot
Priority: Minor
 Fix For: 2.8.0

 Attachments: ApplicationAttemptPage.png, Screen Shot 2015-05-26 at 
 5.46.54 PM.png, YARN-3467.001.patch, YARN-3467.002.patch, yarn-3467-1.patch


 The YARN REST API can report on the following properties:
 *allocatedMB*: The sum of memory in MB allocated to the application's running 
 containers
 *allocatedVCores*: The sum of virtual cores allocated to the application's 
 running containers
 *runningContainers*: The number of containers currently running for the 
 application
 Currently, the RM Web UI does not report on these items (at least I couldn't 
 find any entries within the Web UI).
 It would be useful for YARN Application and Resource troubleshooting to have 
 these properties and their corresponding values exposed on the RM WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567354#comment-14567354
 ] 

Brahma Reddy Battula commented on YARN-3170:


[~aw] thanks a lot for your comments.. Updated the patch based on your 
comments..Kindly review..

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567355#comment-14567355
 ] 

Hadoop QA commented on YARN-3747:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   6m 34s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 52s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 29s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 12s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   6m 29s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  25m  7s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.TestDockerContainerExecutor |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736555/YARN-3747.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/console |


This message was automatically generated.

 TestLocalDirsHandlerService.java: test directory logDir2 not deleted
 

 Key: YARN-3747
 URL: https://issues.apache.org/jira/browse/YARN-3747
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.7.0
Reporter: David Moore
Priority: Minor
  Labels: patch, test, yarn
 Attachments: YARN-3747.patch


 During a code review of 
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
  I noted that logDir2 is never deleted while logDir1 is deleted twice. This 
 is not in keeping with the rest of the function and appears to be a bug. 
 I will be submitting a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567358#comment-14567358
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java


 Application (Attempt and Container) Not Found in AHS results in Internal 
 Server Error (500)
 ---

 Key: YARN-2900
 URL: https://issues.apache.org/jira/browse/YARN-2900
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Mit Desai
 Fix For: 2.7.1

 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
 YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
 YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
 YARN-2900.patch, YARN-2900.patch, YARN-2900.patch


 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
   at 
 org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567291#comment-14567291
 ] 

Steve Loughran commented on YARN-1042:
--

I like this, though I'd also like PREFERRED to have two Rs in the middle :).

Thinking about how I'd use this in slider, I'd probably want to keep the 
escalation logic, when to decide when to accept shared-note placement, in my 
own code. That way the AM can choose to wait 1 minute or more for an 
anti-affine placement before giving up and accepting a node already in use. We 
already do that when asking for a container back on the host where an instance 
ran previously. 

 add ability to specify affinity/anti-affinity in container requests
 ---

 Key: YARN-1042
 URL: https://issues.apache.org/jira/browse/YARN-1042
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Arun C Murthy
 Attachments: YARN-1042-demo.patch


 container requests to the AM should be able to request anti-affinity to 
 ensure that things like Region Servers don't come up on the same failure 
 zones. 
 Similarly, you may be able to want to specify affinity to same host or rack 
 without specifying which specific host/rack. Example: bringing up a small 
 giraph cluster in a large YARN cluster would benefit from having the 
 processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-06-01 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567340#comment-14567340
 ] 

Chang Li commented on YARN-2556:


[~jeagles] could you please help review the latest patch? Thanks!

 Tool to measure the performance of the timeline server
 --

 Key: YARN-2556
 URL: https://issues.apache.org/jira/browse/YARN-2556
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Chang Li
  Labels: BB2015-05-TBR
 Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
 YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
 YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
 YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.2.patch, 
 YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, YARN-2556.6.patch, 
 YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, YARN-2556.patch, 
 yarn2556.patch, yarn2556.patch, yarn2556_wip.patch


 We need to be able to understand the capacity model for the timeline server 
 to give users the tools they need to deploy a timeline server with the 
 correct capacity.
 I propose we create a mapreduce job that can measure timeline server write 
 and read performance. Transactions per second, I/O for both read and write 
 would be a good start.
 This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567357#comment-14567357
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567369#comment-14567369
 ] 

Hadoop QA commented on YARN-3749:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 20s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 35s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |  49m 27s | Tests failed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |  50m 14s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 52s | Tests passed in 
hadoop-yarn-server-tests. |
| | | 147m 20s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestNMClient |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736544/YARN-3749.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 5cc3fce |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/console |


This message was automatically generated.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted

2015-06-01 Thread David Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Moore updated YARN-3747:
--
Attachment: YARN-3747.patch

Please review this patch - 
Thank you


Copyright 2015 David Moore

Licensed under the Apache License, Version 2.0 (the License);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an AS IS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

 TestLocalDirsHandlerService.java: test directory logDir2 not deleted
 

 Key: YARN-3747
 URL: https://issues.apache.org/jira/browse/YARN-3747
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.7.0
Reporter: David Moore
Priority: Minor
 Fix For: 2.7.0

 Attachments: YARN-3747.patch


 During a code review of 
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
  I noted that logDir2 is never deleted while logDir1 is deleted twice. This 
 is not in keeping with the rest of the function and appears to be a bug. 
 I will be submitting a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567323#comment-14567323
 ] 

Anubhav Dhoot commented on YARN-3751:
-

There was no failure for this class in the jenkins run for YARN-3467.
The change LGTM.

 TestAHSWebServices fails after YARN-3467
 

 Key: YARN-3751
 URL: https://issues.apache.org/jira/browse/YARN-3751
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Sunil G
 Attachments: 0001-YARN-3751.patch


 YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
 not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567379#comment-14567379
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java


 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567343#comment-14567343
 ] 

Sean Busbey commented on YARN-3748:
---

+1 lgtm, presuming that failed test passes locally.

nit: I think the numContainers in AbstractCSQueue can be made private now maybe?

 Cleanup Findbugs volatile warnings
 --

 Key: YARN-3748
 URL: https://issues.apache.org/jira/browse/YARN-3748
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gabor Liptak
Priority: Minor
 Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: YARN-3170-009.patch

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567744#comment-14567744
 ] 

Sangjin Lee commented on YARN-3699:
---

Thanks [~djp] for your comments!

Once everyone's comfortable with the decision of not making the flow version 
part of the row key, then we could resolve this JIRA by recording that decision 
(+1 or -1). Then we could open a separate JIRA for the phoenix writer to 
relocate the flow version (remove it from the PK). But a bigger question there 
is, if we're going with the native HBase schema, what is the status of the 
phoenix writer implementation?

For me, (if it wasn't obvious in previous comments), I'm +1 with the flow 
version *not* being in the row key.

 Decide if  flow version should be part of row key or column
 ---

 Key: YARN-3699
 URL: https://issues.apache.org/jira/browse/YARN-3699
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C

 Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
 discussion on putting the flow version in rowkey or column. 
 Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567756#comment-14567756
 ] 

Li Lu commented on YARN-3699:
-

I'm +1 on removing the flow version section from the row key. I can make the 
change to our Phoenix writer. However, I agree with [~sjlee0] that we're not 
sure about the next step plan on the Phoenix writer. I'm OK with leaving it as 
an aggregation-only writer for now. 

 Decide if  flow version should be part of row key or column
 ---

 Key: YARN-3699
 URL: https://issues.apache.org/jira/browse/YARN-3699
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C

 Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
 discussion on putting the flow version in rowkey or column. 
 Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3492) AM fails to come up because RM and NM can't connect to each other

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3492.
---
Resolution: Cannot Reproduce

Closing this based on previous comments. Please reopen this in case you run 
into it again.

 AM fails to come up because RM and NM can't connect to each other
 -

 Key: YARN-3492
 URL: https://issues.apache.org/jira/browse/YARN-3492
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: pseudo-distributed cluster on a mac
Reporter: Karthik Kambatla
Priority: Blocker
 Attachments: mapred-site.xml, 
 yarn-kasha-nodemanager-kasha-mbp.local.log, 
 yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml


 Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The 
 container gets allocated, but doesn't get launched. The NM can't talk to the 
 RM. Logs to follow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF

2015-06-01 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-2872:
-

Assignee: Sunil G

 CapacityScheduler: Add disk I/O resource to DRF
 ---

 Key: YARN-2872
 URL: https://issues.apache.org/jira/browse/YARN-2872
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Karthik Kambatla
Assignee: Sunil G





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567682#comment-14567682
 ] 

Sunil G commented on YARN-3751:
---

Findbugs warning can be skipped as exception handling is done in another method 
in WebServices.java and giving here as false positive. 

Existing test case in TestAHSWebServices covers this scenario.

 TestAHSWebServices fails after YARN-3467
 

 Key: YARN-3751
 URL: https://issues.apache.org/jira/browse/YARN-3751
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Sunil G
 Attachments: 0001-YARN-3751.patch


 YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
 not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567621#comment-14567621
 ] 

Sunil G commented on YARN-2872:
---

Hi [~kasha]
I would like to take up this for CS. I will work on a patch for same.
Kindly reassign if otherwise.

 CapacityScheduler: Add disk I/O resource to DRF
 ---

 Key: YARN-2872
 URL: https://issues.apache.org/jira/browse/YARN-2872
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567609#comment-14567609
 ] 

Vinod Kumar Vavilapalli commented on YARN-1462:
---

bq. Committed to trunk/branch-2/branch-2.7.
[~zjshen]/[~xgong], why are we putting this in 2.7? Looks more of an 
enhancement to me. Unless there is a strong requirement, we should revert it 
from branch-2.7.

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.7.1

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567672#comment-14567672
 ] 

Hadoop QA commented on YARN-3751:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 56s | Pre-patch trunk has 3 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 54s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 55s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 35s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  1s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-server-common. |
| | |  38m 16s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736418/0001-YARN-3751.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/console |


This message was automatically generated.

 TestAHSWebServices fails after YARN-3467
 

 Key: YARN-3751
 URL: https://issues.apache.org/jira/browse/YARN-3751
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Sunil G
 Attachments: 0001-YARN-3751.patch


 YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
 not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567762#comment-14567762
 ] 

Masatake Iwasaki commented on YARN-3752:


My /etc/hosts works for pseudo distributed cluster and unit tests for 
NameNode-HA in HDFS.

Hostnames are successfully resolved at first in TestRMFailover too at first.
{noformat}
java.net.ConnectException: Call From centos7/127.0.0.1 to 0.0.0.0:28031 failed 
on connection exception: java.net.ConnectException: Connection refused; For 
more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
...
java.io.EOFException: End of File Exception between local host is: 
centos7/127.0.0.1; destination host is: 0.0.0.0:18031; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException
...
{noformat}

Client fails to create connection due to UnknownHostException while client 
retries to connect to next RM after failover.
{noformat}
java.io.IOException: java.util.concurrent.ExecutionException: 
java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
destination host is: centos7:28032; java.net.UnknownHostException; For more 
details see:  http://wiki.apache.org/hadoop/UnknownHost
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1487)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1371)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getApplications(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy16.getApplications(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:484)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:461)
at 
org.apache.hadoop.yarn.client.TestRMFailover.verifyClientConnection(TestRMFailover.java:119)
at 
org.apache.hadoop.yarn.client.TestRMFailover.verifyConnections(TestRMFailover.java:133)
at 
org.apache.hadoop.yarn.client.TestRMFailover.testExplicitFailover(TestRMFailover.java:168
.

Caused by: java.net.UnknownHostException: Invalid host name: local host is: 
(unknown); destination host is: centos7:28032; java.net.UnknownHostException; 
For more details see:  http://wiki.apache.org/hadoop/UnknownHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:744)
at org.apache.hadoop.ipc.Client$Connection.init(Client.java:408)
at org.apache.hadoop.ipc.Client$1.call(Client.java:1483)
at org.apache.hadoop.ipc.Client$1.call(Client.java:1480)
at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
... 49 more
{noformat}

It may be timing/environment issue because the test seems to succeed in QA 
tests.


 TestRMFailover fails due to intermittent UnknownHostException
 -

 Key: YARN-3752
 URL: https://issues.apache.org/jira/browse/YARN-3752
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor

 Client fails to create connection due to UnknownHostException while client 
 retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1462:

Fix Version/s: (was: 2.7.1)
   2.8.0

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567758#comment-14567758
 ] 

Xuan Gong commented on YARN-1462:
-

Okay. Reverted it from branch-2.7. And changed the fix version from 
branch-2.7.1 to branch-2.8.

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567851#comment-14567851
 ] 

Jason Lowe commented on YARN-3585:
--

Thanks for the patch, Rohith!

I think it would be safer/simpler to assume we shouldn't be calling Exit unless 
NodeManager.main() was invoked (i.e.: we're likely running in a JVM whose sole 
purpose is to be the nodemanager).  In that sense I'm wondering if we should 
flip the logic to not exit but then have NodeManager.main override that.  This 
probably precludes the need to update existing tests.

We should be using ExitUtil instead of System.exit directly.

Nit: setexitOnShutdownEvent s/b setExitOnShutdownEvent


 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567778#comment-14567778
 ] 

Hudson commented on YARN-1462:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7940 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7940/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567763#comment-14567763
 ] 

Masatake Iwasaki commented on YARN-3752:


While I am doing trial-and-error, using 127.0.0.1 instead of 0.0.0.0 for server 
addresses fixed this.

 TestRMFailover fails due to intermittent UnknownHostException
 -

 Key: YARN-3752
 URL: https://issues.apache.org/jira/browse/YARN-3752
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor

 Client fails to create connection due to UnknownHostException while client 
 retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567775#comment-14567775
 ] 

Masatake Iwasaki commented on YARN-3752:


I don't know the exact reason yet but using independent Configuration instances 
for each RM in MiniYARNCluster as YARN-3749 do also worked.

 TestRMFailover fails due to intermittent UnknownHostException
 -

 Key: YARN-3752
 URL: https://issues.apache.org/jira/browse/YARN-3752
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor

 Client fails to create connection due to UnknownHostException while client 
 retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568005#comment-14568005
 ] 

Matthew Jacobs commented on YARN-2194:
--

While this may work for the default RHEL7 configuration, this will break if 
someone happens to have mounted the same controllers to 
/sys/fs/cgroup/cpuacct,cpu, or if the user mounted other controllers at the 
same path as well. What do you think about creating the symlink from 
/sys/fs/cgroup/cpu to the mounted path for cpu in all cases (unless it was 
actually mounted at /sys/fs/cgroup/cpu of course).

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567969#comment-14567969
 ] 

Vinod Kumar Vavilapalli commented on YARN-2194:
---

Thinking out aloud, should we do OS specific checks for this?

Also, does the newer CGroupsHandlerImpl also need to change? /cc [~vvasudev].

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568063#comment-14568063
 ] 

Sidharta Seethana commented on YARN-2194:
-

Isn't it better to use a different separator that is less likely to be in use ( 
e.g ':' or '|' instead of ',' ) when invoking container-executor ? Granted that 
this is a (slightly) bigger change, but it seems like the right thing to do. 

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568128#comment-14568128
 ] 

Li Lu commented on YARN-3051:
-

Hi [~varun_saxena], thanks for the work! Not sure if you've already made 
progress since the latest patch, but I'm posting some of my comments and 
questions w.r.t the reader API design in the 003 patch. I may have more 
comments in the near future, but I won't mind to see a new patch before posting 
them. 

# I noticed there is a _readerLimit_ for read operations, which works for ATS 
v1. I'm wondering if it's fine to use -1 to indicate there's no such limit? Not 
sure if this feature is already there. 
# The {{fromId}} parameter, we may need to be careful on the concept of id. 
In timeline v2 we need context information to identify each entity, such as 
cluster, user, flow, run. When querying with {{fromId}}, what kind of 
assumptions should we make on the id here? Are we assuming all entities are 
of the same cluster, user, and/or flow, or the id is a concatenation of all 
information, or it's something else? 
# For all filters related parameters, I'm not sure if the current object model 
and storage implementation support a trivial solution. I'd certainly welcome 
any comments/suggestions on this problem. 
# Based on the previous two issues, a more general question is, shall we focus 
on a evolution of the v1 API here, or we start a v2 reader API design from the 
scratch, and then try to make them compatible to the v1 APIs? The current patch 
looks to be pursuing the evolution approach. 
# In some APIs, we're requiring clusterID and appID, but not having flow/run 
information. In the current writer implementations, this indicates a full table 
scan. Maybe we can have flow and run information as optional parameters so that 
we can avoid full table scans when the caller does have flow and run 
information?
# The current APIs require a pretty long list of parameters. For most of the 
use cases, I think we can abstract something much simpler. Do we plan to add 
those simple APIs in a higher layer? I think having a lot of nulls when 
calling reader API looks suboptimal, but with only these few APIs we may need 
to do this frequently?  

 [Storage abstraction] Create backing storage read interface for ATS readers
 ---

 Key: YARN-3051
 URL: https://issues.apache.org/jira/browse/YARN-3051
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3051-YARN-2928.003.patch, 
 YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, 
 YARN-3051.wip.patch, YARN-3051_temp.patch


 Per design in YARN-2928, create backing storage read interface that can be 
 implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3542:
--
Target Version/s: 2.8.0

Let's do this in the 2.8 timeline before the two implementations diverge more.

 Re-factor support for CPU as a resource using the new ResourceHandler 
 mechanism
 ---

 Key: YARN-3542
 URL: https://issues.apache.org/jira/browse/YARN-3542
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Sidharta Seethana
Assignee: Varun Vasudev
Priority: Critical

 In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
 addition of new resource types in the nodemanager (this was used for network 
 as a resource - See YARN-2140 ). We should refactor the existing CPU 
 implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
 the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated YARN-3748:
---
Attachment: (was: YARN-3748.4.patch)

 Cleanup Findbugs volatile warnings
 --

 Key: YARN-3748
 URL: https://issues.apache.org/jira/browse/YARN-3748
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gabor Liptak
Priority: Minor
 Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568083#comment-14568083
 ] 

Sidharta Seethana commented on YARN-3542:
-

+1 to this

 Re-factor support for CPU as a resource using the new ResourceHandler 
 mechanism
 ---

 Key: YARN-3542
 URL: https://issues.apache.org/jira/browse/YARN-3542
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Sidharta Seethana
Assignee: Varun Vasudev
Priority: Critical

 In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
 addition of new resource types in the nodemanager (this was used for network 
 as a resource - See YARN-2140 ). We should refactor the existing CPU 
 implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
 the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568084#comment-14568084
 ] 

Sidharta Seethana commented on YARN-3542:
-

+1 to this

 Re-factor support for CPU as a resource using the new ResourceHandler 
 mechanism
 ---

 Key: YARN-3542
 URL: https://issues.apache.org/jira/browse/YARN-3542
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Sidharta Seethana
Assignee: Varun Vasudev
Priority: Critical

 In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
 addition of new resource types in the nodemanager (this was used for network 
 as a resource - See YARN-2140 ). We should refactor the existing CPU 
 implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
 the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Sumana Sathish (JIRA)
Sumana Sathish created YARN-3753:


 Summary: RM failed to come up with java.io.IOException: Wait for 
ZKClient creation timed out
 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Priority: Critical


RM failed to come up with the following error while submitting an mapreduce job.
{code:title=RM log}
015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(179)) - Error storing app: 
application_1432956515242_0006
java.io.IOException: Wait for ZKClient creation timed out
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:handle(750)) - Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
java.io.IOException: Wait for ZKClient creation timed out
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
at 

[jira] [Assigned] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-3753:
-

Assignee: Jian He

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical

 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
 

[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568145#comment-14568145
 ] 

Jian He commented on YARN-3753:
---

This happens because this exception {{new IOException(Wait for ZKClient 
creation timed out);}} is not retried by upper level runWithRetries method 
which causes RM to fail.  we've seen quite a few issues regarding the retry 
logic of zk-store, YARN-2716 should be the long-term solution to fix all these. 
 In the interim, I'm writing a quick work-around patch for this, as this 
problem makes RM unavailable. 

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical

 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 

[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568187#comment-14568187
 ] 

zhihai xu commented on YARN-3749:
-

Hi [~chenchun], thanks for filing and working on this issue.
The patch seems reasonable to me.
Some nits: 
1. It looks like setRpcAddressForRM and setConfForRM are only used by test 
code. Should we create a new HA test utility file to include these functions?

2. Do we really need the following change at {{MiniYARNCluster#serviceInit}}
{{code}}
conf.set(YarnConfiguration.RM_HA_ID, rm0);
{{code}}
Because I saw {{initResourceManager}} will also configure {{RM_HA_ID}}.

3. Is any particular reason to configure {{YarnConfiguration.RM_HA_ID}} as 
{{RM2_NODE_ID}}  instead of {{RM1_NODE_ID}} in ProtocolHATestBase?
{{code}}
conf.set(YarnConfiguration.RM_HA_ID, RM2_NODE_ID);
{{code}}

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568189#comment-14568189
 ] 

Sergey Shelukhin commented on YARN-1462:


This commit changes newInstance API, breaking Tez build. It is hard to make it 
compatible with both pre-2.8 and 2.8... is it possible to preserve both 
versions of the method?

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568222#comment-14568222
 ] 

Karthik Kambatla commented on YARN-3753:


[~jianhe] - YARN-2716 is ready for review. I can make time for addressing any 
comments to get this in for trunk and branch-2. Given that, would it make sense 
to limit this fix to branch-2.7? 

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical

 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
  

[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568284#comment-14568284
 ] 

Jian He commented on YARN-3753:
---

[~kasha], sure, make sense,  this can go into branch-2.7 only. And YARN-2716 
can get in for trunk and branch-2. 

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical

 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 

[jira] [Comment Edited] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568292#comment-14568292
 ] 

Jian He edited comment on YARN-3753 at 6/2/15 12:32 AM:


Upload a patch to set the wait time based on numRetries*retry-interval. 

I reproduced this issue locally in following way.
1. start ZK
2. start RM
3. kill ZK.
4. submit a job 
  - without the patch, RM will fail with the same IOException(Wait for 
ZKClient creation timed out) 
 - with the patch, after re-start ZK server, RM and job can continue run 
successfully. 


was (Author: jianhe):
Upload a patch to set the wait time based on numRetries*retry-interval. 

I reproduced this issue locally in following way.
1. start RM.
2. start ZK.
3. kill ZK.
4. submit a job 
  - without the patch, RM will fail with the same IOException(Wait for 
ZKClient creation timed out) 
 - with the patch, after re-start ZK server, RM and job can continue run 
successfully. 

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 

[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568429#comment-14568429
 ] 

Masatake Iwasaki commented on YARN-3749:


Thanks for working on this, [~chenchun]. I would like this fix to comes in 
because it seems to affect YARN-3752 I'm looking into.

{quote}
2. Do we really need the following change at MiniYARNCluster#serviceInit

   conf.set(YarnConfiguration.RM_HA_ID, rm0);

Because I saw initResourceManager will also configure RM_HA_ID.
{quote}

When I tried similar to the patch, I got error below because 
{{HAUtil#getRMHAId}} called from {{YarnConfiguration#updateConnectAddr}} 
expects that there is at most 1 RM id matching to the node.

{noformat}
  2015-06-02 10:14:23,648 INFO  [Thread-284] service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
 failed in state STARTED; cause: 
org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has 
multiple addresses that match local node's address.
  org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has 
multiple addresses that match local node's address.
  at org.apache.hadoop.yarn.conf.HAUtil.getRMHAId(HAUtil.java:204)
  at 
org.apache.hadoop.yarn.conf.YarnConfiguration.updateConnectAddr(YarnConfiguration.java:1971)
  at 
org.apache.hadoop.conf.Configuration.updateConnectAddr(Configuration.java:2129)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:357)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:467)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:321)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper$1.run(MiniYARNCluster.java:562)
{noformat}

The check can be bypassed by setting dummy value to 
{{yarn.resourcemanager.ha.id}} in configuration *used by NodeManager instance*. 
I think there should be a comment explain that it is a dummy for unit test at 
least.


 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568462#comment-14568462
 ] 

Rohith commented on YARN-3733:
--

This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-01 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3754:
--

 Summary: Race condition when the NodeManager is shutting down and 
container is launched
 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt



Container is launched and returned to ContainerImpl
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed}}. 


*Attaching the exception trace*
{code}
2015-05-30 02:11:49,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Unable to update state store diagnostics for 
container_e310_1432817693365_3338_01_02
java.io.IOException: org.iq80.leveldb.DBException: Closed
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.iq80.leveldb.DBException: Closed
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
... 15 more

{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-3754:
--

Assignee: Bibin A Chundatt

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt

 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-01 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-3754:
-

Assignee: Sunil G  (was: Bibin A Chundatt)

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Sunil G

 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}
 we can add a check whether DB is closed while we move container from ACQUIRED 
 state.
 As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568573#comment-14568573
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 58s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 55s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 18s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736746/YARN-3170-010.patch |
| Optional Tests | site |
| git revision | trunk / 990078b |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8162/console |


This message was automatically generated.

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, 
 YARN-3170-010.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.7.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568409#comment-14568409
 ] 

Hadoop QA commented on YARN-3753:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 47s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  50m 33s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 42s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736694/YARN-3753.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cdc13ef |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/console |


This message was automatically generated.

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 

[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568428#comment-14568428
 ] 

Tsuyoshi Ozawa commented on YARN-3170:
--

{quote}
The Scheduler has a pluggable policy plug-in
{quote}

I think Allen means the sentence is awkward since pluggable plug-in sounds 
redundant. Could you fix it?



 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568466#comment-14568466
 ] 

Sunil G commented on YARN-3733:
---

I feel clusterResource=0,0 lhs=1,1, and rhs2,2 may happen. But we 
cannot differentiate which is bigger infinity here and thats not correct. Why 
could we check for clusterResource=0,0 prior to * getResourceAsValue()* check 
and handle from there. 

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3754:
---
Description: 
Container is launched and returned to ContainerImpl
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed}}. 


*Attaching the exception trace*
{code}
2015-05-30 02:11:49,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Unable to update state store diagnostics for 
container_e310_1432817693365_3338_01_02
java.io.IOException: org.iq80.leveldb.DBException: Closed
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.iq80.leveldb.DBException: Closed
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
... 15 more

{code}


we can add a check whether DB is closed while we move container from ACQUIRED 
state.

As per the discussion in YARN-3585 have add the same

  was:

Container is launched and returned to ContainerImpl
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed}}. 


*Attaching the exception trace*
{code}
2015-05-30 02:11:49,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Unable to update state store diagnostics for 
container_e310_1432817693365_3338_01_02
java.io.IOException: org.iq80.leveldb.DBException: Closed
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 

[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568539#comment-14568539
 ] 

Rohith commented on YARN-3585:
--

Thanks [~jlowe] for the review .. 

bq. if we should flip the logic to not exit but then have NodeManager.main 
override that. This probably precludes the need to update existing tests.
Make sense to me.. Changed the logic to call jvm exit when NodeMananager is 
instantiated from main function.

bq. We should be using ExitUtil instead of System.exit directly.
Done

bq. Nit: setexitOnShutdownEvent s/b setExitOnShutdownEvent
This method is not necessary now since patch preassume true when it is called 
from only main funtion. I have removed this.

Kindly reveiw updated patch

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3585.patch, YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.4.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568467#comment-14568467
 ] 

Sunil G commented on YARN-3733:
---

I feel clusterResource=0,0 lhs=1,1, and rhs2,2 may happen. But we 
cannot differentiate which is bigger infinity here and thats not correct. Why 
could we check for clusterResource=0,0 prior to * getResourceAsValue()* check 
and handle from there. 

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568433#comment-14568433
 ] 

Zhijie Shen commented on YARN-1462:
---

bq. This commit changes newInstance API, breaking Tez build.

{{newInstance}} is marked as \@Private, and it's not supposed to be used ouside 
Hadoop. What's the use case in Tez?

bq. is it possible to preserve both versions of the method?

It's possible, but the problem is whether should do it. Theoretically, 
compatibility is not required for the private method. If it has strong use case 
to let app report be created outside Hadoop, we should mark this method 
\@Public and keep it compatible over releases.

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.5.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568499#comment-14568499
 ] 

Bibin A Chundatt commented on YARN-3585:


[~rohithsharma] and [~sunilg] Have added jira YARN-3754 for tracking DB 
connection close

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >