[jira] [Resolved] (YARN-218) Distiguish between "failed" and "killed" app attempts

2014-09-15 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White resolved YARN-218.

Resolution: Duplicate

Fixed in YARN-614.

> Distiguish between "failed" and "killed" app attempts
> -
>
> Key: YARN-218
> URL: https://issues.apache.org/jira/browse/YARN-218
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Tom White
>Assignee: Tom White
>
> A "failed" app attempt is one that failed due to an error in the user 
> program, as opposed to one that was "killed" by the system. Like in MapReduce 
> task attempts, we should distinguish the two so that killed attempts do not 
> count against the number of retries (yarn.resourcemanager.am.max-retries).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849016#comment-13849016
 ] 

Tom White commented on YARN-1029:
-

> RM HA uses ZK itself for shared storage, so it already has a dependency on ZK.

This is true when using the ZKRMStateStore, but there are other stores, like 
the FileSystemRMStateStore which don't introduce a ZK dependency. However, I 
agree with your and Karthik's argument about not needing an external ZKFC 
option, or at least doing this JIRA before YARN-1177. That's because supporting 
different RMStateStore implementations for RM HA is more work and potentially 
confusing for users, so we could say to them that for RM HA you have to use the 
ZKRMStateStore, and leader election is embedded in the RM so there is no 
external ZKFC to use.

> Allow embedding leader election into the RM
> ---
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch, 
> yarn-1029-0.patch, yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1028) Add FailoverProxyProvider like capability to RMProxy

2013-12-13 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847432#comment-13847432
 ] 

Tom White commented on YARN-1028:
-

Thanks for addressing the feedback I gave. +1 for the latest patch.

> Add FailoverProxyProvider like capability to RMProxy
> 
>
> Key: YARN-1028
> URL: https://issues.apache.org/jira/browse/YARN-1028
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-1028-1.patch, yarn-1028-2.patch, yarn-1028-3.patch, 
> yarn-1028-4.patch, yarn-1028-5.patch, yarn-1028-6.patch, yarn-1028-7.patch, 
> yarn-1028-8.patch, yarn-1028-draft-cumulative.patch
>
>
> RMProxy layer currently abstracts RM discovery and implements it by looking 
> up service information from configuration. Motivated by HDFS and using 
> existing classes from Common, we can add failover proxy providers that may 
> provide RM discovery in extensible ways.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1028) Add FailoverProxyProvider like capability to RMProxy

2013-12-12 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846395#comment-13846395
 ] 

Tom White commented on YARN-1028:
-

Thanks for the explanation of how failover works, Karthik. I think the failover 
configuration is much better now - the patch is very close. Just a few minor 
comments:

*  The YarnFailoverProxyProvider interface is an improvement. It might be good 
to have "RM" in its name since it is about RM failover. Ditto for 
ConfiguredFailoverProxyProvider.
* It would be nice to have YarnClientImpl still report which RM it submitted to 
- the logical name when HA is enabled, the host/port when not. 
* Nit: TestRMFailover has a spurious log message LOG.error("KK")
* Nit: YARN_MINI_CLUSTER_USE_RPC and DEFAULT_YARN_MINI_CLUSTER_USE_RPC - should 
be "MINICLUSTER" (without a space) for consistency with existing names.

> Add FailoverProxyProvider like capability to RMProxy
> 
>
> Key: YARN-1028
> URL: https://issues.apache.org/jira/browse/YARN-1028
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-1028-1.patch, yarn-1028-2.patch, yarn-1028-3.patch, 
> yarn-1028-4.patch, yarn-1028-5.patch, yarn-1028-6.patch, 
> yarn-1028-draft-cumulative.patch
>
>
> RMProxy layer currently abstracts RM discovery and implements it by looking 
> up service information from configuration. Motivated by HDFS and using 
> existing classes from Common, we can add failover proxy providers that may 
> provide RM discovery in extensible ways.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-11 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845491#comment-13845491
 ] 

Tom White commented on YARN-1029:
-

Implementing ActiveStandbyElector sounds like a good approach, and the patch is 
a good start. From a work sequencing point of view wouldn't it be preferable to 
implement the standalone ZKFC first, since it will share a lot of the code with 
HDFS (i.e. implement the equivalent of DFSZKFailoverController)?

> Allow embedding leader election into the RM
> ---
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1028) Add FailoverProxyProvider like capability to RMProxy

2013-12-11 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845471#comment-13845471
 ] 

Tom White commented on YARN-1028:
-

It looks like the behaviour in this patch differs from the way failover is 
implemented for HDFS HA, where it is controlled by dfs.client.failover settings 
(e.g. dfs.client.failover.max.attempts is configured explicitly rather than 
being calculated from the IPC settings). Would having the corresponding 
settings for RM HA make sense? (E.g. from a configuration consistency and 
well-tested code path point of view.)

Why do you need both YarnFailoverProxyProvider and 
ConfiguredFailoverProxyProvider? The latter should be sufficient; it might also 
be called RMFailoverProxyProvider.

> Add FailoverProxyProvider like capability to RMProxy
> 
>
> Key: YARN-1028
> URL: https://issues.apache.org/jira/browse/YARN-1028
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-1028-1.patch, yarn-1028-2.patch, yarn-1028-3.patch, 
> yarn-1028-4.patch, yarn-1028-5.patch, yarn-1028-draft-cumulative.patch
>
>
> RMProxy layer currently abstracts RM discovery and implements it by looking 
> up service information from configuration. Motivated by HDFS and using 
> existing classes from Common, we can add failover proxy providers that may 
> provide RM discovery in extensible ways.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1144) Unmanaged AMs registering a tracking URI should not be proxy-fied

2013-09-09 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761732#comment-13761732
 ] 

Tom White commented on YARN-1144:
-

+1

> Unmanaged AMs registering a tracking URI should not be proxy-fied
> -
>
> Key: YARN-1144
> URL: https://issues.apache.org/jira/browse/YARN-1144
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Critical
> Fix For: 2.1.1-beta
>
> Attachments: YARN-1144.patch, YARN-1144.patch, YARN-1144.patch
>
>
> Unmanaged AMs do not run in the cluster, their tracking URL should not be 
> proxy-fied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-07-12 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13706826#comment-13706826
 ] 

Tom White commented on YARN-884:


The approach seems reasonable to me. +1 for the patch.

> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
> Attachments: yarn-884-1.patch
>
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-789) Enable zero capabilities resource requests in fair scheduler

2013-06-14 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13683466#comment-13683466
 ] 

Tom White commented on YARN-789:


This looks good to me, +1. The increment setting only applies to the fair 
scheduler now.

> Enable zero capabilities resource requests in fair scheduler
> 
>
> Key: YARN-789
> URL: https://issues.apache.org/jira/browse/YARN-789
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-789.patch, YARN-789.patch, YARN-789.patch
>
>
> Per discussion in YARN-689, reposting updated use case:
> 1. I have a set of services co-existing with a Yarn cluster.
> 2. These services run out of band from Yarn. They are not started as yarn 
> containers and they don't use Yarn containers for processing.
> 3. These services use, dynamically, different amounts of CPU and memory based 
> on their load. They manage their CPU and memory requirements independently. 
> In other words, depending on their load, they may require more CPU but not 
> memory or vice-versa.
> By using YARN as RM for these services I'm able share and utilize the 
> resources of the cluster appropriately and in a dynamic way. Yarn keeps tab 
> of all the resources.
> These services run an AM that reserves resources on their behalf. When this 
> AM gets the requested resources, the services bump up their CPU/memory 
> utilization out of band from Yarn. If the Yarn allocations are 
> released/preempted, the services back off on their resources utilization. By 
> doing this, Yarn and these service correctly share the cluster resources, 
> being Yarn RM the only one that does the overall resource bookkeeping.
> The services AM, not to break the lifecycle of containers, start containers 
> in the corresponding NMs. These container processes do basically a sleep 
> forever (i.e. sleep 1d). They are almost not using any CPU nor memory 
> (less than 1MB). Thus it is reasonable to assume their required CPU and 
> memory utilization is NIL (more on hard enforcement later). Because of this 
> almost NIL utilization of CPU and memory, it is possible to specify, when 
> doing a request, zero as one of the dimensions (CPU or memory).
> The current limitation is that the increment is also the minimum. 
> If we set the memory increment to 1MB. When doing a pure CPU request, we 
> would have to specify 1MB of memory. That would work. However it would allow 
> discretionary memory requests without a desired normalization (increments of 
> 256, 512, etc).
> If we set the CPU increment to 1CPU. When doing a pure memory request, we 
> would have to specify 1CPU. CPU amounts a much smaller than memory amounts, 
> and because we don't have fractional CPUs, it would mean that all my pure 
> memory requests will be wasting 1 CPU thus reducing the overall utilization 
> of the cluster.
> Finally, on hard enforcement. 
> * For CPU. Hard enforcement can be done via a cgroup cpu controller. Using an 
> absolute minimum of a few CPU shares (ie 10) in the LinuxContainerExecutor we 
> ensure there is enough CPU cycles to run the sleep process. This absolute 
> minimum would only kick-in if zero is allowed, otherwise will never kick in 
> as the shares for 1 CPU are 1024.
> * For Memory. Hard enforcement is currently done by the 
> ProcfsBasedProcessTree.java, using a minimum absolute of 1 or 2 MBs would 
> take care of zero memory resources. And again,  this absolute minimum would 
> only kick-in if zero is allowed, otherwise will never kick in as the 
> increment memory is in several MBs if not 1GB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-803) factor out scheduler config validation from the ResourceManager to each scheduler implementation

2013-06-14 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13683454#comment-13683454
 ] 

Tom White commented on YARN-803:


Looks good to me. Is there a reason FairScheduler#validateConf is public? +1 
otherwise.

> factor out scheduler config validation from the ResourceManager to each 
> scheduler implementation
> 
>
> Key: YARN-803
> URL: https://issues.apache.org/jira/browse/YARN-803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-803.patch, YARN-803.patch
>
>
> Per discussion in YARN-789 we should factor out from the ResourceManager 
> class the scheduler config validations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-689) Add multiplier unit to resourcecapabilities

2013-06-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677937#comment-13677937
 ] 

Tom White commented on YARN-689:


+1 to Hitesh and Bikas' points about minimum (and increment) being an internal 
scheduling artifact, and removing it from the API (or at least making it clear 
that AMs shouldn't use it as a multiplier).

> Add multiplier unit to resourcecapabilities
> ---
>
> Key: YARN-689
> URL: https://issues.apache.org/jira/browse/YARN-689
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-689.patch, YARN-689.patch, YARN-689.patch, 
> YARN-689.patch, YARN-689.patch
>
>
> Currently we overloading the minimum resource value as the actual multiplier 
> used by the scheduler.
> Today with a minimum memory set to 1GB, requests for 1.5GB are always 
> translated to allocation of 2GB.
> We should decouple the minimum allocation from the multiplier.
> The multiplier should also be exposed to the client via the 
> RegisterApplicationMasterResponse

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-689) Add multiplier unit to resourcecapabilities

2013-06-06 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676843#comment-13676843
 ] 

Tom White commented on YARN-689:


> The multiplier seems to be one approach to do this, though as Arun says it 
> might make the code in the scheduler complicated to get right.

The change doesn't make the code more complicated. Instead of the range of 
allowable requests being [1, n] (normalized by the minimum), it becomes [m, n], 
for configurable m and n, with m = 1 as the default.

> Add multiplier unit to resourcecapabilities
> ---
>
> Key: YARN-689
> URL: https://issues.apache.org/jira/browse/YARN-689
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-689.patch, YARN-689.patch, YARN-689.patch, 
> YARN-689.patch, YARN-689.patch
>
>
> Currently we overloading the minimum resource value as the actual multiplier 
> used by the scheduler.
> Today with a minimum memory set to 1GB, requests for 1.5GB are always 
> translated to allocation of 2GB.
> We should decouple the minimum allocation from the multiplier.
> The multiplier should also be exposed to the client via the 
> RegisterApplicationMasterResponse

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-689) Add multiplier unit to resourcecapabilities

2013-06-05 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676096#comment-13676096
 ] 

Tom White commented on YARN-689:


> the change in the yarn-default.xml from 32 to 4 is because 
> YarnConfiguration.java defines the default to 4, so it is just putting them 
> in synch

That makes sense.

> Add multiplier unit to resourcecapabilities
> ---
>
> Key: YARN-689
> URL: https://issues.apache.org/jira/browse/YARN-689
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-689.patch, YARN-689.patch, YARN-689.patch, 
> YARN-689.patch, YARN-689.patch
>
>
> Currently we overloading the minimum resource value as the actual multiplier 
> used by the scheduler.
> Today with a minimum memory set to 1GB, requests for 1.5GB are always 
> translated to allocation of 2GB.
> We should decouple the minimum allocation from the multiplier.
> The multiplier should also be exposed to the client via the 
> RegisterApplicationMasterResponse

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-689) Add multiplier unit to resourcecapabilities

2013-06-05 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675988#comment-13675988
 ] 

Tom White commented on YARN-689:



I had a look at the latest patch. Why is 
yarn.scheduler.maximum-allocation-vcores in yarn-default.xml changed from 32 to 
4? Apart from that change it looks good to me. +1

Arun, I don't know why you are so resistant to this change. It is a 
backwards-compatible change for users. Indeed it doesn't change the behaviour 
at all by default. It's not a complicated or risky patch. It adds an extra 
configuration knob so that instead of the increment being the minimum value (of 
memory or cores) it can be a different value to allow for more fine-grained 
control in some cases, such as the one that Sandy 
[outlined|https://issues.apache.org/jira/browse/YARN-689?focusedCommentId=13661159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13661159].
 How would you configure YARN to handle this situation?

I don't understand the point about fragmentation either. In the scenario you 
[describe 
above|https://issues.apache.org/jira/browse/YARN-689?focusedCommentId=13673517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13673517],
 I think you are talking about a situation where AMs are asking for containers 
that are greater than 1GB in size, e.g. 1.7 or 1.9GB. If so, then today if you 
set the minimum to 128MB (to give finer-grained increments) you open yourself 
up to the same possibility of fragmentation since there could be nodes with 
<1GB left if AMs ask for over 1GB. The patch in this JIRA will not change that 
situation (set minimum to 1GB, increment to 128MB) - there will still be the 
same number of nodes with <1GB left and the AMs will not be able to use them. 
So I don't see how this change makes the situation worse.

In fact I think this change could improve utilization. If you set the minimum 
to 1GB then a request for 1.7 or 1.9GB will be changed to 2GB, but this might 
not be what the user wants since the extra memory may not be needed (after all, 
the AM didn't ask for it) and is a form of poor utilization since it means 
there's certainly no way that memory can be used by other containers. In the 
case of a smaller increment (128MB) the extra memory would not be allocated to 
containers that are not using it, so it can be used by other containers, e.g. 3 
x 1.7GB requests and one 1.9GB request could fit in a node with 7GB, whereas 
only three of them could fit if the minimum was 1GB.

> Add multiplier unit to resourcecapabilities
> ---
>
> Key: YARN-689
> URL: https://issues.apache.org/jira/browse/YARN-689
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-689.patch, YARN-689.patch, YARN-689.patch, 
> YARN-689.patch, YARN-689.patch
>
>
> Currently we overloading the minimum resource value as the actual multiplier 
> used by the scheduler.
> Today with a minimum memory set to 1GB, requests for 1.5GB are always 
> translated to allocation of 2GB.
> We should decouple the minimum allocation from the multiplier.
> The multiplier should also be exposed to the client via the 
> RegisterApplicationMasterResponse

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-689) Add multiplier unit to resourcecapabilities

2013-06-03 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673582#comment-13673582
 ] 

Tom White commented on YARN-689:


It seems that there's a concern about impacting CS with this change - if so, 
then have CapacityScheduler just return minCapability for 
getIncrementResourceCapability().

Also, can you keep the two arg constructor for ClusterInfo by setting 
incrementCapability to minCapability by default? Then unrelated tests wouldn't 
need to change.


> Add multiplier unit to resourcecapabilities
> ---
>
> Key: YARN-689
> URL: https://issues.apache.org/jira/browse/YARN-689
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-689.patch, YARN-689.patch, YARN-689.patch, 
> YARN-689.patch
>
>
> Currently we overloading the minimum resource value as the actual multiplier 
> used by the scheduler.
> Today with a minimum memory set to 1GB, requests for 1.5GB are always 
> translated to allocation of 2GB.
> We should decouple the minimum allocation from the multiplier.
> The multiplier should also be exposed to the client via the 
> RegisterApplicationMasterResponse

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-595) Refactor fair scheduler to use common Resources

2013-04-24 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640612#comment-13640612
 ] 

Tom White commented on YARN-595:


+1

> Refactor fair scheduler to use common Resources
> ---
>
> Key: YARN-595
> URL: https://issues.apache.org/jira/browse/YARN-595
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-595-1.patch, YARN-595.patch, YARN-595.patch
>
>
> resourcemanager.fair and resourcemanager.resources have two copies of 
> basically the same code for operations on Resource objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-142) Change YARN APIs to throw IOException

2013-02-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573516#comment-13573516
 ] 

Tom White commented on YARN-142:


> I think it'd be useful to have the APIs throw IOException and 
> YarnRemoteException. The IOException indicating errors from the RPC layer, 
> YarnException indicating errors from Yarn itself.

I see the latest patch has

{noformat}throws 
UnknownApplicationException,YarnRemoteException,IOException{noformat}

even though UnknownApplicationException is a subclass of YarnRemoteException, 
and YarnRemoteException is a subclass of IOException. It would be simpler to 
make the method signature

{noformat}throws IOException{noformat}

and draw attention to the different subclasses in the javadoc if needed.

> Change YARN APIs to throw IOException
> -
>
> Key: YARN-142
> URL: https://issues.apache.org/jira/browse/YARN-142
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 0.23.3, 2.0.0-alpha
>Reporter: Siddharth Seth
>Assignee: Xuan Gong
>Priority: Critical
> Attachments: YARN-142.1.patch, YARN-142.2.patch, YARN-142.3.patch, 
> YARN-142.4.patch
>
>
> Ref: MAPREDUCE-4067
> All YARN APIs currently throw YarnRemoteException.
> 1) This cannot be extended in it's current form.
> 2) The RPC layer can throw IOExceptions. These end up showing up as 
> UndeclaredThrowableExceptions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-06 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572374#comment-13572374
 ] 

Tom White commented on YARN-371:


Glad this turned into a good debate :)

On the task-centric vs resource-centric approach - I agree with Arun and 
Bobby's points. Supporting two protocols is not to be done lightly, and we'd 
need a strong motivating use case. Perhaps there is one, but the change may not 
be something we need to do before 2.0 goes GA.

The restrictions on how you specify locations for resource requests are 
interesting. You have to say "node 1, rack 1, *" - it's not sufficient to say 
"node 1" or "rack 1" or even "node 1, rack 1". The schedulers use the * (ALL) 
request to calculate the total resources for an application, but could they 
support the ones without ALL with no protocol changes? E.g. calculate the total 
resources for an app by summing the lower-level requests? Perhaps this could be 
explored in another JIRA.

More generally, the request styles that Bobby mentions like gang scheduling, 
any rack placement (where containers are on the same rack for locality, but it 
doesn't matter which one), one-per node placement (for e.g. HBase) - are all 
currently beyond the current system. How well YARN supports non-MR workloads 
will be determined to some extent by how well it can support these request 
styles. Perhaps we should simply make sure that the API is flexible enough to 
accommodate changes. I see ResourceRequest is an abstract class so it would be 
possible to add a method in the future in a compatible way to support optional 
extra information that the scheduler might be able to use. Is that sufficient, 
or should some of the YARN APIs be downgraded from @Stable to provide an option 
to change them to support alternative request styles in the 2.x timeframe?

> Resource-centric compression in AM-RM protocol limits scheduling
> 
>
> Key: YARN-371
> URL: https://issues.apache.org/jira/browse/YARN-371
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Each AMRM heartbeat consists of a list of resource requests. Currently, each 
> resource request consists of a container count, a resource vector, and a 
> location, which may be a node, a rack, or "*". When an application wishes to 
> request a task run in multiple localtions, it must issue a request for each 
> location.  This means that for a node-local task, it must issue three 
> requests, one at the node-level, one at the rack-level, and one with * (any). 
> These requests are not linked with each other, so when a container is 
> allocated for one of them, the RM has no way of knowing which others to get 
> rid of. When a node-local container is allocated, this is handled by 
> decrementing the number of requests on that node's rack and in *. But when 
> the scheduler allocates a task with a node-local request on its rack, the 
> request on the node is left there.  This can cause delay-scheduling to try to 
> assign a container on a node that nobody cares about anymore.
> Additionally, unless I am missing something, the current model does not allow 
> requests for containers only on a specific node or specific rack. While this 
> is not a use case for MapReduce currently, it is conceivable that it might be 
> something useful to support in the future, for example to schedule 
> long-running services that persist state in a particular location, or for 
> applications that generally care less about latency than data-locality.
> Lastly, the ability to understand which requests are for the same task will 
> possibly allow future schedulers to make more intelligent scheduling 
> decisions, as well as permit a more exact understanding of request load.
> I would propose the tweak of allowing a single ResourceRequest to encapsulate 
> all the location information for a task.  So instead of just a single 
> location, a ResourceRequest would contain an array of locations, including 
> nodes that it would be happy with, racks that it would be happy with, and 
> possibly *.  Side effects of this change would be a reduction in the amount 
> of data that needs to be transferred in a heartbeat, as well in as the RM's 
> memory footprint, becaused what used to be different requests for the same 
> task are now able to share some common data.
> While this change breaks compatibility, if it is going to happen, it makes 
> sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/

[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570109#comment-13570109
 ] 

Tom White commented on YARN-371:


Looks like there's a misunderstanding here - Sandy talks about _reducing_ the 
memory requirements of the RM. If I understand the proposal correctly, the 
number of resource request objects sent by the AM in MR would be reduced from 
five (three node-local, one rack-local, one ANY) to one resource request with 
an array of locations (host names) of length five.

BTW Arun, immediately vetoing an issue in the first comment is not conducive to 
a balanced discussion!

> Consolidate resource requests in AM-RM heartbeat
> 
>
> Key: YARN-371
> URL: https://issues.apache.org/jira/browse/YARN-371
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Each AMRM heartbeat consists of a list of resource requests. Currently, each 
> resource request consists of a container count, a resource vector, and a 
> location, which may be a node, a rack, or "*". When an application wishes to 
> request a task run in multiple localtions, it must issue a request for each 
> location.  This means that for a node-local task, it must issue three 
> requests, one at the node-level, one at the rack-level, and one with * (any). 
> These requests are not linked with each other, so when a container is 
> allocated for one of them, the RM has no way of knowing which others to get 
> rid of. When a node-local container is allocated, this is handled by 
> decrementing the number of requests on that node's rack and in *. But when 
> the scheduler allocates a task with a node-local request on its rack, the 
> request on the node is left there.  This can cause delay-scheduling to try to 
> assign a container on a node that nobody cares about anymore.
> Additionally, unless I am missing something, the current model does not allow 
> requests for containers only on a specific node or specific rack. While this 
> is not a use case for MapReduce currently, it is conceivable that it might be 
> something useful to support in the future, for example to schedule 
> long-running services that persist state in a particular location, or for 
> applications that generally care less about latency than data-locality.
> Lastly, the ability to understand which requests are for the same task will 
> possibly allow future schedulers to make more intelligent scheduling 
> decisions, as well as permit a more exact understanding of request load.
> I would propose the tweak of allowing a single ResourceRequest to encapsulate 
> all the location information for a task.  So instead of just a single 
> location, a ResourceRequest would contain an array of locations, including 
> nodes that it would be happy with, racks that it would be happy with, and 
> possibly *.  Side effects of this change would be a reduction in the amount 
> of data that needs to be transferred in a heartbeat, as well in as the RM's 
> memory footprint, becaused what used to be different requests for the same 
> task are now able to share some common data.
> While this change breaks compatibility, if it is going to happen, it makes 
> sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-319) Submit a job to a queue that not allowed in fairScheduler, client will hold forever.

2013-01-22 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559583#comment-13559583
 ] 

Tom White commented on YARN-319:


> When waiting for the final application status to be failed, you can use a 
> smaller sleep inside a loop. TestNodeManagerShutdown has something like this 
> on line 141.

Would it be possible to use a synchronous event handler in the tests so that we 
don't have to poll?

> Submit a job to a queue that not allowed in fairScheduler, client will hold 
> forever.
> 
>
> Key: YARN-319
> URL: https://issues.apache.org/jira/browse/YARN-319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: shenhong
>Assignee: shenhong
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-319-1.patch, YARN-319-2.patch, YARN-319.patch
>
>
> RM use fairScheduler, when client submit a job to a queue, but the queue do 
> not allow the user to submit job it, in this case, client  will hold forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-333) On app submission, have RM ask scheduler for queue name

2013-01-22 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559548#comment-13559548
 ] 

Tom White commented on YARN-333:


This is not a problem in MR is it, since the queue is always set? But I can see 
that it would be needed in general.

The approach looks fine, although I think it would be simpler just to have a 
getDefaultQueueName() method on YarnScheduler.


> On app submission, have RM ask scheduler for queue name
> ---
>
> Key: YARN-333
> URL: https://issues.apache.org/jira/browse/YARN-333
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-333.patch
>
>
> Currently, if an app is submitted without a queue, RMAppManager sets the 
> RMApp's queue to "default".
> A scheduler may wish to make its own decision on which queue to place an app 
> in if none is specified. For example, when the fair scheduler 
> user-as-default-queue config option is set to true, and an app is submitted 
> with no queue specified, the fair scheduler should assign the app to a queue 
> with the user's name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-319) Submit a job to a queue that not allowed in fairScheduler, client will hold forever.

2013-01-15 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553942#comment-13553942
 ] 

Tom White commented on YARN-319:


Is it possible to write a unit test for this?

> Submit a job to a queue that not allowed in fairScheduler, client will hold 
> forever.
> 
>
> Key: YARN-319
> URL: https://issues.apache.org/jira/browse/YARN-319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: shenhong
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-319.patch
>
>
> RM use fairScheduler, when client submit a job to a queue, but the queue do 
> not allow the user to submit job it, in this case, client  will hold forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-302) Fair scheduler assignmultiple should default to false

2013-01-15 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553948#comment-13553948
 ] 

Tom White commented on YARN-302:


+1

> Fair scheduler assignmultiple should default to false
> -
>
> Key: YARN-302
> URL: https://issues.apache.org/jira/browse/YARN-302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-302.patch
>
>
> The MR1 default was false.  When true, it results in overloading some 
> machines with many tasks and underutilizing others.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-307) NodeManager should log container launch command.

2013-01-03 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542993#comment-13542993
 ] 

Tom White commented on YARN-307:


Lohit, if you set yarn.nodemanager.delete.debug-delay-sec then the launch 
script won't be deleted straightaway, which helps with debugging.

> NodeManager should log container launch command.
> 
>
> Key: YARN-307
> URL: https://issues.apache.org/jira/browse/YARN-307
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>
> NodeManager's DefaultContainerExecutor seems to log only path of default 
> container executor script instead of contents of script. It would be good to 
> log the execution command so that one could see what is being launched.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-286) Add a YARN ApplicationClassLoader

2013-01-02 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542060#comment-13542060
 ] 

Tom White commented on YARN-286:


The patch is needed by MAPREDUCE-1700 for class isolation in MR tasks.

> Add a YARN ApplicationClassLoader
> -
>
> Key: YARN-286
> URL: https://issues.apache.org/jira/browse/YARN-286
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-286.patch
>
>
> Add a classloader that provides webapp-style class isolation for use by 
> applications. This is the YARN part of MAPREDUCE-1700 (which was already 
> developed in that JIRA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-192) Node update causes NPE in the fair scheduler

2012-12-21 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538112#comment-13538112
 ] 

Tom White commented on YARN-192:


If I run the new test without the fix it passes. Sandy, can you change the test 
to fail in that case?

> Node update causes NPE in the fair scheduler
> 
>
> Key: YARN-192
> URL: https://issues.apache.org/jira/browse/YARN-192
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-192.patch
>
>
> The exception occurs when unreserve is called on an FSSchedulerApp with a 
> NodeId that it does not know about.  The RM seems to have a different idea 
> about what apps are reserved for which node than the scheduler.
> 2012-10-29 22:30:52,901 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp.unreserve(FSSchedulerApp.java:356)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.unreserve(AppSchedulable.java:214)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:266)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:330)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueSchedulable.assignContainer(FSQueueSchedulable.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:759)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:836)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:329)
> at java.lang.Thread.run(Thread.java:662)
> 2012-10-29 22:30:52,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-286) Add a YARN ApplicationClassLoader

2012-12-21 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-286:
---

Attachment: YARN-286.patch

> Add a YARN ApplicationClassLoader
> -
>
> Key: YARN-286
> URL: https://issues.apache.org/jira/browse/YARN-286
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-286.patch
>
>
> Add a classloader that provides webapp-style class isolation for use by 
> applications. This is the YARN part of MAPREDUCE-1700 (which was already 
> developed in that JIRA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-286) Add a YARN ApplicationClassLoader

2012-12-21 Thread Tom White (JIRA)
Tom White created YARN-286:
--

 Summary: Add a YARN ApplicationClassLoader
 Key: YARN-286
 URL: https://issues.apache.org/jira/browse/YARN-286
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: applications
Affects Versions: 2.0.2-alpha
Reporter: Tom White
Assignee: Tom White
 Fix For: 2.0.3-alpha


Add a classloader that provides webapp-style class isolation for use by 
applications. This is the YARN part of MAPREDUCE-1700 (which was already 
developed in that JIRA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-103) Add a yarn AM - RM client module

2012-12-20 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536990#comment-13536990
 ] 

Tom White commented on YARN-103:


Some feedback on the current patch:
* Why not use the existing ResourceRequest rather than creating a new type 
ContainerRequest? Having two equivalent types seems confusing.
* Is the ANY constant meant to be used by users? It looks like you specify a 
ContainerRequest with null hosts and racks in this case. If so, then it would 
be useful to add a constructor that doesn't take hosts or racks for that case, 
although given my previous point it would be easier to use ResourceRequest.
* {{assertTrue(containersRequestedRack == 2);}} appears on two successive 
lines. I think the assertion should be about containersRequestedAny.
* "// do a few iterations to ensure RM is not going send new containers" - wait 
in the loop to allow NMs to heartbeat?

> Add a yarn AM - RM client module
> 
>
> Key: YARN-103
> URL: https://issues.apache.org/jira/browse/YARN-103
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-103.1.patch, YARN-103.2.patch, YARN-103.3.patch, 
> YARN-103.4.patch, YARN-103.4.wrapper.patch, YARN-103.5.patch, 
> YARN-103.6.patch, YARN-103.7.patch
>
>
> Add a basic client wrapper library to the AM RM protocol in order to prevent 
> proliferation of code being duplicated everywhere. Provide helper functions 
> to perform reverse mapping of container requests to RM allocation resource 
> request table format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-103) Add a yarn AM - RM client module

2012-12-20 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536929#comment-13536929
 ] 

Tom White commented on YARN-103:


> The interface shouldn't really be implemented by anyone outside of YARN

This is the heart of the problem. We don't have a way to say (via the audience 
annotations) that an interface is for read-only use only - and not for users to 
implement. An interface may be @Public @Stable from the point of view of a user 
who wants to call it, but that doesn't mean that folks should implement it 
themselves, since for interfaces like the one we are discussing we might want 
to add a new method, say (note that such a change is compatible with @Stable). 
Adding a new method is fine for the first type of user, but not for the second, 
since their implementation breaks.

In this case, I think it's likely we'll add more methods. For example, it would 
be useful to add a waitForState method to YarnClient (which is also an 
interface), which waits for a given application to reach a particular 
YarnApplicationState. If YarnClient were a class then this would be a 
compatible change, but if it's an interface then it is not.

I think we should do one of the following:
1. Change YarnClient and AMRMClient to be concrete implementations.
2. Leave the interface/implementation distinction and make the interfaces 
@Public @Unstable.

I prefer 1. since these classes are helper classes - they are not a 
tightly-defined interface.

> Add a yarn AM - RM client module
> 
>
> Key: YARN-103
> URL: https://issues.apache.org/jira/browse/YARN-103
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-103.1.patch, YARN-103.2.patch, YARN-103.3.patch, 
> YARN-103.4.patch, YARN-103.4.wrapper.patch, YARN-103.5.patch, 
> YARN-103.6.patch, YARN-103.7.patch
>
>
> Add a basic client wrapper library to the AM RM protocol in order to prevent 
> proliferation of code being duplicated everywhere. Provide helper functions 
> to perform reverse mapping of container requests to RM allocation resource 
> request table format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-103) Add a yarn AM - RM client module

2012-12-18 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535045#comment-13535045
 ] 

Tom White commented on YARN-103:


> +1, this is really helpful if we can get the APIs right.

I agree this would be very useful. Regarding getting the APIs right, making 
AMRMClient a class rather than an interface would permit us to add methods 
later on, which is not possible for interfaces. This is a helper class, so 
there shouldn't be a need for alternative implementations. 

> Add a yarn AM - RM client module
> 
>
> Key: YARN-103
> URL: https://issues.apache.org/jira/browse/YARN-103
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-103.1.patch, YARN-103.2.patch, YARN-103.3.patch, 
> YARN-103.4.patch, YARN-103.4.wrapper.patch, YARN-103.5.patch, 
> YARN-103.6.patch, YARN-103.7.patch
>
>
> Add a basic client wrapper library to the AM RM protocol in order to prevent 
> proliferation of code being duplicated everywhere. Provide helper functions 
> to perform reverse mapping of container requests to RM allocation resource 
> request table format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-252) Unmanaged AMs should not have to set the AM's ContainerLaunchContext

2012-12-18 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535034#comment-13535034
 ] 

Tom White commented on YARN-252:


Bikas - it's a toy YARN app written in Java that runs the AM from the client. I 
agree we need both.

> Unmanaged AMs should not have to set the AM's ContainerLaunchContext
> 
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch, YARN-252.patch, YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-272) Fair scheduler log messages try to print objects without overridden toString methods

2012-12-18 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534981#comment-13534981
 ] 

Tom White commented on YARN-272:


It would be better to add the toString() methods since then new messages that 
print the objects won't suffer from the same problem.

> Fair scheduler log messages try to print objects without overridden toString 
> methods
> 
>
> Key: YARN-272
> URL: https://issues.apache.org/jira/browse/YARN-272
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-272.patch
>
>
> A lot of junk gets printed out like this:
> 2012-12-11 17:31:52,998 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Application application_1355270529654_0003 reserved container 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl@324f0f97
>  on node host: c1416.hal.cloudera.com:46356 #containers=7 available=0 
> used=8192, currently has 4 at priority 
> org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl@33; 
> currentReservation 4096

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-12-17 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534044#comment-13534044
 ] 

Tom White commented on YARN-230:


Arun, yes it looks good to me, +1. We can address any changes that come up in 
later JIRAs. 

> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch, YARN-230.4.patch, YARN-230.5.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-252) Unmanaged AMs should not have to set the AM's ContainerLaunchContext

2012-12-14 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-252:
---

Attachment: YARN-252.patch

Another patch with some comments about ACLs.

> Would you please register your ideas in a comment on YARN-255.

Done.

> I didnt think that someone else would need to write an unmanagedAM launcher 
> client. So this would be the only client. Do you have any scenarios in mind 
> where another client would be needed?

That's interesting. I looked at the UnmanagedAMLauncher but it didn't fit my 
use case since I want to run the AM in the same process as the launcher client 
(i.e. in the same JVM). We could extend the launcher to support that case; I 
opened YARN-273

It's still possible that folks will want to write their own unmanaged AM 
clients (e.g. in a web app context, or where class isolation needs handling 
differently), but those cases should be relatively rare.

> Unmanaged AMs should not have to set the AM's ContainerLaunchContext
> 
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch, YARN-252.patch, YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-273) Add an unmanaged AM client for in-process AMs

2012-12-14 Thread Tom White (JIRA)
Tom White created YARN-273:
--

 Summary: Add an unmanaged AM client for in-process AMs
 Key: YARN-273
 URL: https://issues.apache.org/jira/browse/YARN-273
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: client
Reporter: Tom White
Assignee: Tom White


UnmanagedAMLauncher assumes that the AM is launched in a separate process. We 
should add support for running the AM in the same JVM (either by extending 
UnmanagedAMLauncher or adding another class).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-255) Support secure AM launch for unmanaged AM's

2012-12-14 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532414#comment-13532414
 ] 

Tom White commented on YARN-255:


Per discussion on YARN-252, the ACL methods might be better moved to 
ApplicationSubmissionContext, since unmanaged AMs don't have a 
ContainerLaunchContext (for the AM).

> Support secure AM launch for unmanaged AM's
> ---
>
> Key: YARN-255
> URL: https://issues.apache.org/jira/browse/YARN-255
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Bikas Saha
>Assignee: Bikas Saha
>
> Currently unmanaged AM launch does not get security tokens because tokens are 
> passed by the RM to the AM via the NM during AM container launch. For 
> unmanaged AM's the RM can send tokens in the SubmitApplicationResponse to the 
> secure client. The client can then pass these onto the AM in a manner similar 
> to the NM. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-12-14 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532371#comment-13532371
 ] 

Tom White commented on YARN-230:


OK, in which case please make the default state store the filesystem one with 
the default URI discussed earlier.

> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch, YARN-230.4.patch, YARN-230.5.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-12-13 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531147#comment-13531147
 ] 

Tom White commented on YARN-230:


Thanks for addressing my feedback Bikas. The NullRMStateStore is a good idea. 
With it, there is no need for yarn.resourcemanager.recovery.enabled, instead 
make the default yarn.resourcemanager.store.class the NullRMStateStore. For 
this to work NullRMStateStore's loadState method should return an unpopulated 
RMState object rather than null.



> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch, YARN-230.4.patch, YARN-230.5.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-252) Unmanaged AMs should not have to set the AM's ContainerLaunchContext

2012-12-13 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-252:
---

Summary: Unmanaged AMs should not have to set the AM's 
ContainerLaunchContext  (was: Unmanaged AMs should not have to set the 
ContainerLaunchContext)

> Unmanaged AMs should not have to set the AM's ContainerLaunchContext
> 
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch, YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-252) Unmanaged AMs should not have to set the ContainerLaunchContext

2012-12-13 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531104#comment-13531104
 ] 

Tom White commented on YARN-252:


> I think it would be good to add the assert in the other places too. e.g. to 
> catch a bug when context is set for the app but not in the attempt because of 
> some new code bug.

I'm not sure where you are suggesting such a check is needed. Which 
class/method do you mean?

> In general I am thinking of having 1 client allow users to launch apps, in 
> managed or unmanaged mode.

I think that is a good goal, but this change doesn't preclude that at all. E.g. 
if ACLs aren't needed for the AM then not setting the AM's container spec 
should be OK. If ACLs are needed, then either set it on 
ApplicationSubmissionContext#setAMContainerSpec, or perhaps move the ACl 
methods to top-level methods of ApplicationSubmissionContext. The latter 
question could be tackled in YARN-255.

> I am not clear why application writers would need to do this. Application 
> writers need to think about setting context for containers and not AM's 
> themselves, dont they?

I'm talking about the ApplicationSubmissionContext#setAMContainerSpec call, 
which folks writing YARN apps call in their client. If the AM is unmanaged then 
there is no AM container so they shouldn't have to call this method. Of course, 
the AM will typically launch containers too and they will certainly need 
ContainerLaunchContext objects, set via 
StartContainerRequest#setContainerLaunchContext. Does that make more sense?

> Unmanaged AMs should not have to set the ContainerLaunchContext
> ---
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch, YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-253) Container launch may fail if no files were localized

2012-12-12 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-253:
---

Attachment: YARN-253.patch

New patch which removes code from TestContainerManagerSecurity.

> Container launch may fail if no files were localized
> 
>
> Key: YARN-253
> URL: https://issues.apache.org/jira/browse/YARN-253
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
>Priority: Blocker
> Attachments: YARN-253.patch, YARN-253.patch, YARN-253-test.patch
>
>
> This can be demonstrated with DistributedShell. The containers running the 
> shell do not have any files to localize (if there is no shell script to copy) 
> so if they run on a different NM to the AM (which does localize files), then 
> they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-253) Container launch may fail if no files were localized

2012-12-12 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530009#comment-13530009
 ] 

Tom White commented on YARN-253:


Thanks Vinod. Looking at LCE, it seems to handle this case already since 
create_container_directories uses mkdirs which will create parent directories 
as needed.

I'll create another patch which removes the comment and temp files.

> Container launch may fail if no files were localized
> 
>
> Key: YARN-253
> URL: https://issues.apache.org/jira/browse/YARN-253
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
>Priority: Blocker
> Attachments: YARN-253.patch, YARN-253-test.patch
>
>
> This can be demonstrated with DistributedShell. The containers running the 
> shell do not have any files to localize (if there is no shell script to copy) 
> so if they run on a different NM to the AM (which does localize files), then 
> they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-252) Unmanaged AMs should not have to set the ContainerLaunchContext

2012-12-03 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-252:
---

Attachment: YARN-252.patch

Bikas, thanks for taking a look. I was thinking that since there is no AM 
container in an unmanaged situation then setting one doesn't make sense - and 
in fact requiring that one be set is very confusing for YARN application 
writers. For YARN-255 the tokens could be passed by another API call.

I've updated the patch to assert the context is null for unmanaged AMs.

> Unmanaged AMs should not have to set the ContainerLaunchContext
> ---
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch, YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-253) Container launch may fail if no files were localized

2012-12-03 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508685#comment-13508685
 ] 

Tom White commented on YARN-253:


I see that you removed the change in MAPREDUCE-4427 because it was an unrelated 
bug, which made sense in the context of that work. I'm curious to hear if 
there's a better fix than creating parent directories on demand.

> Container launch may fail if no files were localized
> 
>
> Key: YARN-253
> URL: https://issues.apache.org/jira/browse/YARN-253
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-253.patch, YARN-253-test.patch
>
>
> This can be demonstrated with DistributedShell. The containers running the 
> shell do not have any files to localize (if there is no shell script to copy) 
> so if they run on a different NM to the AM (which does localize files), then 
> they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-253) Container launch may fail if no files were localized

2012-11-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-253:
---

Attachment: YARN-253.patch

This fixes the problem by creating parent directories if they don't already 
exist. Without the fix the test would fail about 4 times in 10; with the fix I 
didn't see a failure.

> Container launch may fail if no files were localized
> 
>
> Key: YARN-253
> URL: https://issues.apache.org/jira/browse/YARN-253
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-253.patch, YARN-253-test.patch
>
>
> This can be demonstrated with DistributedShell. The containers running the 
> shell do not have any files to localize (if there is no shell script to copy) 
> so if they run on a different NM to the AM (which does localize files), then 
> they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-253) Container launch may fail if no files were localized

2012-11-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-253:
---

Attachment: YARN-253-test.patch

This change to the DS unit test exposes the problem by changing the mini YARN 
cluster to have multiple NMs and multiple local directories so that the 
appcache created by the AM is not necessarily the same one used by the shell 
container.

Here's the stacktrace showing the failure:

{noformat}
2012-11-30 18:01:02,729 WARN  [ContainersLauncher #0] launcher.ContainerLaunch 
(ContainerLaunch.java:call(247)) - Failed to launch container.
java.io.FileNotFoundException: File 
/Users/tom/workspace/hadoop-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell/org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-localDir-nm-2_0/usercache/tom/appcache/application_1354298449311_0001
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:485)
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:996)
at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:150)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:187)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:712)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:708)
at 
org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2361)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:708)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:332)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
{noformat}

> Container launch may fail if no files were localized
> 
>
> Key: YARN-253
> URL: https://issues.apache.org/jira/browse/YARN-253
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-253-test.patch
>
>
> This can be demonstrated with DistributedShell. The containers running the 
> shell do not have any files to localize (if there is no shell script to copy) 
> so if they run on a different NM to the AM (which does localize files), then 
> they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-253) Container launch may fail if no files were localized

2012-11-30 Thread Tom White (JIRA)
Tom White created YARN-253:
--

 Summary: Container launch may fail if no files were localized
 Key: YARN-253
 URL: https://issues.apache.org/jira/browse/YARN-253
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha
Reporter: Tom White
Assignee: Tom White


This can be demonstrated with DistributedShell. The containers running the 
shell do not have any files to localize (if there is no shell script to copy) 
so if they run on a different NM to the AM (which does localize files), then 
they will fail since the appcache directory does not exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-252) Unmanaged AMs should not have to set the ContainerLaunchContext

2012-11-30 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507442#comment-13507442
 ] 

Tom White commented on YARN-252:


Testing is covered by the existing TestUnmanagedAMLauncher.

> Unmanaged AMs should not have to set the ContainerLaunchContext
> ---
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-252) Unmanaged AMs should not have to set the ContainerLaunchContext

2012-11-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-252:
---

Attachment: YARN-252.patch

Simple fix.

> Unmanaged AMs should not have to set the ContainerLaunchContext
> ---
>
> Key: YARN-252
> URL: https://issues.apache.org/jira/browse/YARN-252
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-252.patch
>
>
> Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, 
> even though the container is not used (since the AM doesn't run in a managed 
> YARN container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-252) Unmanaged AMs should not have to set the ContainerLaunchContext

2012-11-30 Thread Tom White (JIRA)
Tom White created YARN-252:
--

 Summary: Unmanaged AMs should not have to set the 
ContainerLaunchContext
 Key: YARN-252
 URL: https://issues.apache.org/jira/browse/YARN-252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.0.2-alpha
Reporter: Tom White
Assignee: Tom White


Not calling ApplicationSubmissionContext#setAMContainerSpec causes a NPE, even 
though the container is not used (since the AM doesn't run in a managed YARN 
container).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-251) Proxy URI generation fails for blank tracking URIs

2012-11-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-251:
---

Attachment: YARN-251.patch

With this fix applications that don't set a tracking URI won't get a confusing 
warning.

> Proxy URI generation fails for blank tracking URIs
> --
>
> Key: YARN-251
> URL: https://issues.apache.org/jira/browse/YARN-251
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-251.patch
>
>
> If the URI is an empty string (the default if not set), then a warning is 
> displayed. A null URI displays no such warning. These two cases should be 
> handled in the same way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-251) Proxy URI generation fails for blank tracking URIs

2012-11-30 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507403#comment-13507403
 ] 

Tom White commented on YARN-251:


Running TestRM produces the following warning (although the test doesn't fail):

{noformat}
2012-11-30 15:20:20,136 WARN  [AsyncDispatcher event handler] 
attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:generateProxyUriWithoutScheme(412)) - Could not proxify 
java.net.URISyntaxException: Expected authority at index 7: http://
at java.net.URI$Parser.fail(URI.java:2810)
at java.net.URI$Parser.failExpecting(URI.java:2816)
at java.net.URI$Parser.parseHierarchical(URI.java:3064)
at java.net.URI$Parser.parse(URI.java:3015)
at java.net.URI.(URI.java:577)
at 
org.apache.hadoop.yarn.server.webproxy.ProxyUriUtils.getUriFromAMUrl(ProxyUriUtils.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.generateProxyUriWithoutScheme(RMAppAttemptImpl.java:403)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2800(RMAppAttemptImpl.java:87)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:802)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:792)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:509)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:86)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:440)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:421)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:680)
{noformat}

> Proxy URI generation fails for blank tracking URIs
> --
>
> Key: YARN-251
> URL: https://issues.apache.org/jira/browse/YARN-251
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Tom White
>Assignee: Tom White
>
> If the URI is an empty string (the default if not set), then a warning is 
> displayed. A null URI displays no such warning. These two cases should be 
> handled in the same way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-251) Proxy URI generation fails for blank tracking URIs

2012-11-30 Thread Tom White (JIRA)
Tom White created YARN-251:
--

 Summary: Proxy URI generation fails for blank tracking URIs
 Key: YARN-251
 URL: https://issues.apache.org/jira/browse/YARN-251
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Tom White
Assignee: Tom White


If the URI is an empty string (the default if not set), then a warning is 
displayed. A null URI displays no such warning. These two cases should be 
handled in the same way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-72) NM should handle cleaning up containers when it shuts down ( and kill containers from an earlier instance when it comes back up after an unclean shutdown )

2012-11-30 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507290#comment-13507290
 ] 

Tom White commented on YARN-72:
---

+1

> NM should handle cleaning up containers when it shuts down ( and kill 
> containers from an earlier instance when it comes back up after an unclean 
> shutdown )
> ---
>
> Key: YARN-72
> URL: https://issues.apache.org/jira/browse/YARN-72
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Hitesh Shah
>Assignee: Sandy Ryza
> Attachments: YARN-72-1.patch, YARN-72-2.patch, YARN-72-2.patch, 
> YARN-72.patch
>
>
> Ideally, the NM should wait for a limited amount of time when it gets a 
> shutdown signal for existing containers to complete and kill the containers ( 
> if we pick an aggressive approach ) after this time interval. 
> For NMs which come up after an unclean shutdown, the NM should look through 
> its directories for existing container.pids and try and kill an existing 
> containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-72) NM should handle cleaning up containers when it shuts down

2012-11-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-72:
--

Summary: NM should handle cleaning up containers when it shuts down  (was: 
NM should handle cleaning up containers when it shuts down ( and kill 
containers from an earlier instance when it comes back up after an unclean 
shutdown ))

Updated summary to reflect current scope.

> NM should handle cleaning up containers when it shuts down
> --
>
> Key: YARN-72
> URL: https://issues.apache.org/jira/browse/YARN-72
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Hitesh Shah
>Assignee: Sandy Ryza
> Attachments: YARN-72-1.patch, YARN-72-2.patch, YARN-72-2.patch, 
> YARN-72.patch
>
>
> Ideally, the NM should wait for a limited amount of time when it gets a 
> shutdown signal for existing containers to complete and kill the containers ( 
> if we pick an aggressive approach ) after this time interval. 
> For NMs which come up after an unclean shutdown, the NM should look through 
> its directories for existing container.pids and try and kill an existing 
> containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-170) NodeManager stop() gets called twice on shutdown

2012-11-30 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507289#comment-13507289
 ] 

Tom White commented on YARN-170:


+1

Vinod, do you have any more comments before we commit this?

> NodeManager stop() gets called twice on shutdown
> 
>
> Key: YARN-170
> URL: https://issues.apache.org/jira/browse/YARN-170
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-170-1.patch, YARN-170-2.patch, YARN-170.patch
>
>
> The stop method in the NodeManager gets called twice when the NodeManager is 
> shut down via the shutdown hook.
> The first is the stop that gets called directly by the shutdown hook.  The 
> second occurs when the NodeStatusUpdaterImpl is stopped.  The NodeManager 
> responds to the NodeStatusUpdaterImpl stop stateChanged event by stopping 
> itself.  This is so that NodeStatusUpdaterImpl can notify the NodeManager to 
> stop, by stopping itself in response to a request from the ResourceManager
> This could be avoided if the NodeStatusUpdaterImpl were to stop the 
> NodeManager by calling its stop method directly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-248) Restore RMDelegationTokenSecretManager state on restart

2012-11-29 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506574#comment-13506574
 ] 

Tom White commented on YARN-248:


I thought about it more, and delegation tokens from the job's credentials are 
written to the job submit directory in a file called appTokens. This is 
MR-specific, however, so we can't use it from YARN, which means ii) isn't 
appropriate.

This issue affects JT job recovery too (see MAPREDUCE-4830).


> Restore RMDelegationTokenSecretManager state on restart
> ---
>
> Key: YARN-248
> URL: https://issues.apache.org/jira/browse/YARN-248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tom White
>
> On restart, the RM creates a new RMDelegationTokenSecretManager with fresh 
> state. This will cause problems for Oozie jobs running on secure clusters 
> since the delegation tokens stored in the job credentials (used by the Oozie 
> launcher job to submit a job to the RM) will not be recognized by the RM, and 
> recovery will fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-187) Add hierarchical queues to the fair scheduler

2012-11-29 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506408#comment-13506408
 ] 

Tom White commented on YARN-187:


Sandy, thanks for adding the javadoc. It would be good to update the 
FairScheduler.apt.vm file too to mention hierarchical queues.

Does this change affect existing allocations files now that there is a concept 
of a root queue? My reading is that no changes are needed - but I wanted to 
check before committing this.

> Add hierarchical queues to the fair scheduler
> -
>
> Key: YARN-187
> URL: https://issues.apache.org/jira/browse/YARN-187
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-187-1.patch, YARN-187-1.patch, YARN-187.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-248) Restore RMDelegationTokenSecretManager state on restart

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505726#comment-13505726
 ] 

Tom White commented on YARN-248:


I can think of a couple of options: i) persist RMDelegationTokenSecretManager's 
state (secret keys, all current delegation tokens), or ii) recover the 
delegation tokens from the persisted credentials of the applications to be 
recovered (and the secret keys from a filesystem or ZK store).

> Restore RMDelegationTokenSecretManager state on restart
> ---
>
> Key: YARN-248
> URL: https://issues.apache.org/jira/browse/YARN-248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tom White
>
> On restart, the RM creates a new RMDelegationTokenSecretManager with fresh 
> state. This will cause problems for Oozie jobs running on secure clusters 
> since the delegation tokens stored in the job credentials (used by the Oozie 
> launcher job to submit a job to the RM) will not be recognized by the RM, and 
> recovery will fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-248) Restore RMDelegationTokenSecretManager state on restart

2012-11-28 Thread Tom White (JIRA)
Tom White created YARN-248:
--

 Summary: Restore RMDelegationTokenSecretManager state on restart
 Key: YARN-248
 URL: https://issues.apache.org/jira/browse/YARN-248
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tom White


On restart, the RM creates a new RMDelegationTokenSecretManager with fresh 
state. This will cause problems for Oozie jobs running on secure clusters since 
the delegation tokens stored in the job credentials (used by the Oozie launcher 
job to submit a job to the RM) will not be recognized by the RM, and recovery 
will fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505427#comment-13505427
 ] 

Tom White commented on YARN-230:


bq. But it could make sense to remove application attempts but not remove the 
application, couldn't it? Say we want to remove some attempt from the saved 
state before the application is done.

Let's add it when we need it then.

bq. We also need to change the AM retry default to > 1. Otherwise, even with RM 
restart enabled, the restarted attempts will fail because the previous AM will 
delete job files. What is your suggestion for that?

I think this is where the killed/failed distinction comes in. If the app 
attempt was killed (because the RM died), then the app will be retried since 
the first attempt didn't count (from the point of view of 
yarn.resourcemanager.am.max-retries). This should be taken care of in YARN-218 
- does that sound OK to you?

> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-231) Add persistent store implementation for RMStateStore

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505417#comment-13505417
 ] 

Tom White commented on YARN-231:


I was assuming that the store was HA since it would be backed by HDFS or ZK.

Anyway, I'm fine refining the set of exceptions as this changes in follow-on 
JIRAs, even though it probably is just IOException at this point.

> Add persistent store implementation for RMStateStore
> 
>
> Key: YARN-231
> URL: https://issues.apache.org/jira/browse/YARN-231
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-231.1.patch
>
>
> Add stores that write RM state data to ZooKeeper and FileSystem 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-187) Add hierarchical queues to the fair scheduler

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505402#comment-13505402
 ] 

Tom White commented on YARN-187:


License headers are missing on the new files. (BTW I filed HADOOP-9097 so that 
the release audit stops giving false positives here.)

Would it be possible to add a bit of documentation to explain how to use this 
new feature?

> Add hierarchical queues to the fair scheduler
> -
>
> Key: YARN-187
> URL: https://issues.apache.org/jira/browse/YARN-187
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-187.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-72) NM should handle cleaning up containers when it shuts down ( and kill containers from an earlier instance when it comes back up after an unclean shutdown )

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505398#comment-13505398
 ] 

Tom White commented on YARN-72:
---

I don't think it needs to be configurable, since it's a best effort cleanup 
anyway. What you have seems reasonable to me, although you might want to make 
the 1000 a constant so it's clear what it is for (slop value).

Does the test fail without the change?

Nit: I would change waitForContainersOnShutdownMs to 
waitForContainersOnShutdownMillis.

> NM should handle cleaning up containers when it shuts down ( and kill 
> containers from an earlier instance when it comes back up after an unclean 
> shutdown )
> ---
>
> Key: YARN-72
> URL: https://issues.apache.org/jira/browse/YARN-72
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Hitesh Shah
>Assignee: Sandy Ryza
> Attachments: YARN-72-1.patch, YARN-72.patch
>
>
> Ideally, the NM should wait for a limited amount of time when it gets a 
> shutdown signal for existing containers to complete and kill the containers ( 
> if we pick an aggressive approach ) after this time interval. 
> For NMs which come up after an unclean shutdown, the NM should look through 
> its directories for existing container.pids and try and kill an existing 
> containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-170) NodeManager stop() gets called twice on shutdown

2012-11-28 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505384#comment-13505384
 ] 

Tom White commented on YARN-170:


This looks like an overall improvement. A few minor comments on the latest 
patch:

* The reboot() method doesn't need to be public.
* The new classes are missing license headers.
* Please add a test for the change.

> NodeManager stop() gets called twice on shutdown
> 
>
> Key: YARN-170
> URL: https://issues.apache.org/jira/browse/YARN-170
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-170-1.patch, YARN-170.patch
>
>
> The stop method in the NodeManager gets called twice when the NodeManager is 
> shut down via the shutdown hook.
> The first is the stop that gets called directly by the shutdown hook.  The 
> second occurs when the NodeStatusUpdaterImpl is stopped.  The NodeManager 
> responds to the NodeStatusUpdaterImpl stop stateChanged event by stopping 
> itself.  This is so that NodeStatusUpdaterImpl can notify the NodeManager to 
> stop, by stopping itself in response to a request from the ResourceManager
> This could be avoided if the NodeStatusUpdaterImpl were to stop the 
> NodeManager by calling its stop method directly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-231) Add persistent store implementation for RMStateStore

2012-11-27 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504772#comment-13504772
 ] 

Tom White commented on YARN-231:


bq. Errors could either be because of store error or loss of master status.

Do you mean loss of master status of the RM doing the store? I wouldn't expect 
the store to know about the master's status, since that would be handled at a 
higher level (i.e. by the RM itself). So IOException would be sufficient, at 
least for this JIRA.

bq. All interfaces are marked Evolving for this reason.

It would be better to use Unstable until the RM HA work is done. Evolving would 
mean that you couldn't change the exceptions between 2.0.3 and 2.0.4 (say).

> Add persistent store implementation for RMStateStore
> 
>
> Key: YARN-231
> URL: https://issues.apache.org/jira/browse/YARN-231
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-231.1.patch
>
>
> Add stores that write RM state data to ZooKeeper and FileSystem 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-11-27 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504765#comment-13504765
 ] 

Tom White commented on YARN-230:


{quote}
The store and remove methods have been made mirrors because it helps maintain 
symmetry of operations that is logically clear. An actual implementation could 
choose to remove the entire app data including attempts in removeApplication() 
making removeApplicationAttempt() a no-op. So that alternative is not precluded 
in the current interface while still maintaining flexibility at the interface.
{quote}

Why is this flexibility needed? I can't see why it makes sense to remove an 
application and leave some application attempts around.

bq. I chose to not use directories for FileSystem because one could put a key 
value store behind a FileSystem interface and I am not sure how directories 
would work in them.

That's reasonable. With the orphan handling (deletion) on restart, the flat 
structure you have should work fine. (However, I don't think you need the 
removeApplicationAttempt() method.)

bq. One improvement would be to update the store with an attempts final state 
(failed/killed/succeeded) and wait for it to be recorded before completing the 
state machine.

I agree this can be done later.

bq. Could you please help by providing a good system path.

How about something like ${hadoop.tmp.dir}/yarn/system/rm-store?

> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-231) Add persistent store implementation for RMStateStore

2012-11-23 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503229#comment-13503229
 ] 

Tom White commented on YARN-231:


Some initial feedback on FileSystemRMStateStore (I haven't looked at the ZK 
store):

* There are references to znodes in FileSystemRMStateStore.
* Failure to read a file in FileSystemRMStateStore should not cause the whole 
recovery process to fail, just that particular application or attempt.
* The literals "application_" and "appattempt_" should be made into constants 
that live in ApplicationId and ApplicationAttemptId.


> Add persistent store implementation for RMStateStore
> 
>
> Key: YARN-231
> URL: https://issues.apache.org/jira/browse/YARN-231
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: YARN-231.1.patch
>
>
> Add stores that write RM state data to ZooKeeper and FileSystem 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-230) Make changes for RM restart phase 1

2012-11-23 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503228#comment-13503228
 ] 

Tom White commented on YARN-230:


Overall, this looks great. General feedback:

* Can we make application removal atomic? If the RM shuts down after a 
completed application is removed from the state store, but before the app 
attempts are removed from the store, then the app attempts may be orphaned. 
(There's a comment about it in FileSystemRMStateStore, but no action is taken 
so the attempt files will remain in the store.) It might be better to make 
RMStateStore#removeApplicationState responsible for removing the app attempts 
(i.e. remove removeApplicationAttemptState). This would solve the orphaning 
problem, and it would also make it possible to store the app attempts in a 
directory nested under the application directory, which would be nicer from a 
scaling point of view, and also for someone having to debug the state on the 
filesystem.
* If the RM shuts down before a (successful) completed application is removed 
from the state store, will it be rerun on restart, or will the fact that a 
successful app attempt was stored mean that it doesn't need to? Obviously, the 
second one would be preferable.
* The exceptions thrown by the public methods of RMStateStore should be more 
specific than Exception.
* Let's have a default for yarn.resourcemanager.store.class in 
yarn-default.xml. StoreFactory has MemoryRMStateStore as the default, but 
that's not useful when running on a cluster; FileSystemRMStateStore would be 
better. Similarly it would be good to have the default location for the store 
be a system directory on the default file system. With these two changes folks 
would only need to set yarn.resourcemanager.recovery.enabled to true to enable 
recovery. (We might make that enabled by default at some point too.)
* MemoryRMStateStore#removeApplicationState will fail if asserts are disabled: 
the remove method should be called in a separate statement and assigned to a 
variable which can be checked in the assert. It's worth checking if this 
problem exists elsewhere.
* Naming nit: Store was renamed to RMStateStore, but so StoreFactory should be 
renamed to RMStateStoreFactory.
* Naming nit: zk.rm-state-store rather than zk.rmstatestore for consistency 
with other property names. Also for fs.rmstatestore, and 
zk.rmstatestore.parentpath (parent-path).


> Make changes for RM restart phase 1
> ---
>
> Key: YARN-230
> URL: https://issues.apache.org/jira/browse/YARN-230
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-229) Remove old code for restart

2012-11-23 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503178#comment-13503178
 ] 

Tom White commented on YARN-229:


Can you make org.apache.hadoop.yarn.server.resourcemanager.recovery @Private 
@Unstable please (by adding a package-info.java class), since the contents are 
still in flux. Otherwise this looks good to me. 

> Remove old code for restart
> ---
>
> Key: YARN-229
> URL: https://issues.apache.org/jira/browse/YARN-229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: YARN-229.1.patch
>
>
> Much of the code is dead/commented out and is not executed. Removing it will 
> help with making and understanding new changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-19 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500434#comment-13500434
 ] 

Tom White commented on YARN-128:


I had a quick look at the new patches and FileSystemRMStateStore and 
ZKRMStateStore seem to be missing default constructors, which StoreFactory 
needs. You might change the tests to use StoreFactory to construct the store 
instances to test this code path.

> Resurrect RM Restart 
> -
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Arun C Murthy
>Assignee: Bikas Saha
> Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
> restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
> RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
> YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
> YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
> YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
> YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-72) NM should handle cleaning up containers when it shuts down ( and kill containers from an earlier instance when it comes back up after an unclean shutdown )

2012-11-19 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500195#comment-13500195
 ] 

Tom White commented on YARN-72:
---

Sandy, this looks like a good start, hooking in the code for container cleanup. 
I would focus on the part to cleanup on shutdown in this patch, and tackle 
cleanup on startup in YARN-73.

As Bikas mentioned there needs to be a timeout on waiting for the containers to 
shutdown. The shutdown process waits for up to 
yarn.nodemanager.process-kill-wait.ms for the PID to appear, then 
yarn.nodemanager.sleep-delay-before-sigkill.ms before sending a SIGKILL signal 
(after a SIGTERM) if the process hasn't died - see 
ContainerLaunch#cleanupContainer. Waiting for a little longer than the sum of 
these durations would be sufficient.

Regarding testing, you could have a test like the one in 
TestContainerLaunch#testDelayedKill to test that containers are correctly 
cleaned up after stopping a NM.

> NM should handle cleaning up containers when it shuts down ( and kill 
> containers from an earlier instance when it comes back up after an unclean 
> shutdown )
> ---
>
> Key: YARN-72
> URL: https://issues.apache.org/jira/browse/YARN-72
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Hitesh Shah
>Assignee: Sandy Ryza
> Attachments: YARN-72.patch
>
>
> Ideally, the NM should wait for a limited amount of time when it gets a 
> shutdown signal for existing containers to complete and kill the containers ( 
> if we pick an aggressive approach ) after this time interval. 
> For NMs which come up after an unclean shutdown, the NM should look through 
> its directories for existing container.pids and try and kill an existing 
> containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498883#comment-13498883
 ] 

Tom White commented on YARN-128:


You are right about there being no race - I missed the comment! I opened 
YARN-218 for the killed/failed distinction as I agree it can be tackled 
separately.

> Resurrect RM Restart 
> -
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Arun C Murthy
>Assignee: Bikas Saha
> Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
> RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
> YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
> YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
> YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
> YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-218) Distiguish between "failed" and "killed" app attempts

2012-11-16 Thread Tom White (JIRA)
Tom White created YARN-218:
--

 Summary: Distiguish between "failed" and "killed" app attempts
 Key: YARN-218
 URL: https://issues.apache.org/jira/browse/YARN-218
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Tom White
Assignee: Tom White


A "failed" app attempt is one that failed due to an error in the user program, 
as opposed to one that was "killed" by the system. Like in MapReduce task 
attempts, we should distinguish the two so that killed attempts do not count 
against the number of retries (yarn.resourcemanager.am.max-retries).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498825#comment-13498825
 ] 

Tom White commented on YARN-128:


Bikas, this looks good so far. Thanks for working on it. A few comments:

* Is there a race condition in ResourceManager#recover where RMAppImpl#recover 
is called after the StartAppAttemptTransition from resubmitting the app? The 
problem would be that the earlier app attempts (from before the resart) would 
not be the first ones since the new attempt would get in first.
* I think we need the concept of a 'killed' app attempt (when the system is at 
fault, not the app) as well as a 'failed' attempt, like we have in MR task 
attempts. Without the distinction a restart will count against the user's app 
attempts (default 1 retry) which is undesirable.
* Rather than change the ResourceManager constructor, you could read the 
recoveryEnabled flag from the configuration.

> Resurrect RM Restart 
> -
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Arun C Murthy
>Assignee: Bikas Saha
> Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
> RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
> YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
> YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
> YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-129) Simplify classpath construction for mini YARN tests

2012-11-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492416#comment-13492416
 ] 

Tom White commented on YARN-129:


The test failure is unrelated - see MAPREDUCE-4774.

bq. Tom, does this work when running tests from Maven? Asking because, if I 
recall correctly, Maven heavily uses classloaders and if they are used for the 
project classpath most likely the JVM java.class.path wont have the project 
classpath.

Yes it does. I checked and the JVM java.class.path is being set correctly by 
the Maven test runner.

> Simplify classpath construction for mini YARN tests
> ---
>
> Key: YARN-129
> URL: https://issues.apache.org/jira/browse/YARN-129
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-129.patch, YARN-129.patch, YARN-129.patch
>
>
> The test classpath includes a special file called 'mrapp-generated-classpath' 
> (or similar in distributed shell) that is constructed at build time, and 
> whose contents are a classpath with all the dependencies needed to run the 
> tests. When the classpath for a container (e.g. the AM) is constructed the 
> contents of mrapp-generated-classpath is read and added to the classpath, and 
> the file itself is then added to the classpath so that later when the AM 
> constructs a classpath for a task container it can propagate the test 
> classpath correctly.
> This mechanism can be drastically simplified by propagating the system 
> classpath of the current JVM (read from the java.class.path property) to a 
> launched JVM, but only if running in the context of the mini YARN cluster. 
> Any tests that use the mini YARN cluster will automatically work with this 
> change. Although any that explicitly deal with mrapp-generated-classpath can 
> be simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-169) Update log4j.appender.EventCounter to use org.apache.hadoop.log.metrics.EventCounter

2012-11-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492378#comment-13492378
 ] 

Tom White commented on YARN-169:


+1 pending Jenkins.

> Update log4j.appender.EventCounter to use 
> org.apache.hadoop.log.metrics.EventCounter
> 
>
> Key: YARN-169
> URL: https://issues.apache.org/jira/browse/YARN-169
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Anthony Rojas
>Priority: Minor
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-169.patch
>
>
> We should update the log4j.appender.EventCounter in 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/container-log4j.properties
>  to use *org.apache.hadoop.log.metrics.EventCounter* rather than 
> *org.apache.hadoop.metrics.jvm.EventCounter* to avoid triggering the 
> following warning:
> {code}WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. 
> Please use org.apache.hadoop.log.metrics.EventCounter in all the 
> log4j.properties files{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-164) Race condition in Fair Scheduler

2012-11-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492373#comment-13492373
 ] 

Tom White commented on YARN-164:


+1. I've just triggered Jenkins for this patch. The lack of a unit test is OK 
since it's very hard to test for the absence of a race condition.

> Race condition in Fair Scheduler
> 
>
> Key: YARN-164
> URL: https://issues.apache.org/jira/browse/YARN-164
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Critical
> Attachments: YARN-164.patch
>
>
> {code:xml}
>   Thread updateThread = new Thread(new UpdateThread());
>   updateThread.start();
>   initialized = true;
> {code}
> In the above code, making the initialized as true after starting the 
> UpdateThread.
> {code:xml}
>   private class UpdateThread implements Runnable {
> public void run() {
>   while (initialized) {
> try {
>   Thread.sleep(UPDATE_INTERVAL);
>   update();
>   preemptTasksIfNecessary();
> } catch (Exception e) {
>   LOG.error("Exception in fair scheduler UpdateThread", e);
> }
>   }
> }
>   }
>  {code}
>  In this run method of UpdateThread, it is checking for the initialized and 
> exiting if it is not true. Here most of the times initialized is getting true 
> after exiting the UpdateThread and the thread functionality is missing, due 
> to that all the submitted applications are hanging without making any 
> progress.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-183) Clean up fair scheduler code

2012-11-07 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492348#comment-13492348
 ] 

Tom White commented on YARN-183:


+1 looks good to me. The patch doesn't apply any longer - could you regenerate 
it please so we can get a clean Jenkins run. Thanks.

> Clean up fair scheduler code
> 
>
> Key: YARN-183
> URL: https://issues.apache.org/jira/browse/YARN-183
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Minor
> Attachments: YARN-183.patch
>
>
> The fair scheduler code has a bunch of minor stylistic issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-129) Simplify classpath construction for mini YARN tests

2012-10-23 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-129:
---

Attachment: YARN-129.patch

Thanks for taking a look, Vinod. I've updated the patch so it applies cleanly 
to trunk.

> Simplify classpath construction for mini YARN tests
> ---
>
> Key: YARN-129
> URL: https://issues.apache.org/jira/browse/YARN-129
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-129.patch, YARN-129.patch, YARN-129.patch
>
>
> The test classpath includes a special file called 'mrapp-generated-classpath' 
> (or similar in distributed shell) that is constructed at build time, and 
> whose contents are a classpath with all the dependencies needed to run the 
> tests. When the classpath for a container (e.g. the AM) is constructed the 
> contents of mrapp-generated-classpath is read and added to the classpath, and 
> the file itself is then added to the classpath so that later when the AM 
> constructs a classpath for a task container it can propagate the test 
> classpath correctly.
> This mechanism can be drastically simplified by propagating the system 
> classpath of the current JVM (read from the java.class.path property) to a 
> launched JVM, but only if running in the context of the mini YARN cluster. 
> Any tests that use the mini YARN cluster will automatically work with this 
> change. Although any that explicitly deal with mrapp-generated-classpath can 
> be simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-145) Add a Web UI to the fair share scheduler

2012-10-11 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474035#comment-13474035
 ] 

Tom White commented on YARN-145:


+1 I ran a single node cluster and monitored the fair scheduler page while 
running a job and it looked correct.

> Add a Web UI to the fair share scheduler
> 
>
> Key: YARN-145
> URL: https://issues.apache.org/jira/browse/YARN-145
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Sandy Ryza
> Attachments: YARN-145.patch
>
>
> The fair scheduler had a UI in MR1.  Port the capacity scheduler web UI and 
> modify appropriately for the fair share scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-146) Add unit tests for computing fair share in the fair scheduler

2012-10-10 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White reassigned YARN-146:
--

Assignee: Sandy Ryza

> Add unit tests for computing fair share in the fair scheduler
> -
>
> Key: YARN-146
> URL: https://issues.apache.org/jira/browse/YARN-146
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.0.2-alpha
>
> Attachments: YARN-146.patch
>
>
> MR1 had TestComputeFairShares.  This should go into the YARN fair scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-146) Add unit tests for computing fair share in the fair scheduler

2012-10-10 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473329#comment-13473329
 ] 

Tom White commented on YARN-146:


Looks good to me. Nit: please change the test to use JUnit 4.

> Add unit tests for computing fair share in the fair scheduler
> -
>
> Key: YARN-146
> URL: https://issues.apache.org/jira/browse/YARN-146
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha
>Reporter: Sandy Ryza
> Fix For: 2.0.2-alpha
>
> Attachments: YARN-146.patch
>
>
> MR1 had TestComputeFairShares.  This should go into the YARN fair scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-129) Simplify classpath construction for mini YARN tests

2012-09-25 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-129:
---

Attachment: YARN-129.patch

Patch to fix findbugs warnings.

> Simplify classpath construction for mini YARN tests
> ---
>
> Key: YARN-129
> URL: https://issues.apache.org/jira/browse/YARN-129
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Reporter: Tom White
>Assignee: Tom White
> Fix For: 2.0.3-alpha
>
> Attachments: YARN-129.patch, YARN-129.patch
>
>
> The test classpath includes a special file called 'mrapp-generated-classpath' 
> (or similar in distributed shell) that is constructed at build time, and 
> whose contents are a classpath with all the dependencies needed to run the 
> tests. When the classpath for a container (e.g. the AM) is constructed the 
> contents of mrapp-generated-classpath is read and added to the classpath, and 
> the file itself is then added to the classpath so that later when the AM 
> constructs a classpath for a task container it can propagate the test 
> classpath correctly.
> This mechanism can be drastically simplified by propagating the system 
> classpath of the current JVM (read from the java.class.path property) to a 
> launched JVM, but only if running in the context of the mini YARN cluster. 
> Any tests that use the mini YARN cluster will automatically work with this 
> change. Although any that explicitly deal with mrapp-generated-classpath can 
> be simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-129) Simplify classpath construction for mini YARN tests

2012-09-25 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated YARN-129:
---

Attachment: YARN-129.patch

Here's a patch that implements this idea.

> Simplify classpath construction for mini YARN tests
> ---
>
> Key: YARN-129
> URL: https://issues.apache.org/jira/browse/YARN-129
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Reporter: Tom White
>Assignee: Tom White
> Attachments: YARN-129.patch
>
>
> The test classpath includes a special file called 'mrapp-generated-classpath' 
> (or similar in distributed shell) that is constructed at build time, and 
> whose contents are a classpath with all the dependencies needed to run the 
> tests. When the classpath for a container (e.g. the AM) is constructed the 
> contents of mrapp-generated-classpath is read and added to the classpath, and 
> the file itself is then added to the classpath so that later when the AM 
> constructs a classpath for a task container it can propagate the test 
> classpath correctly.
> This mechanism can be drastically simplified by propagating the system 
> classpath of the current JVM (read from the java.class.path property) to a 
> launched JVM, but only if running in the context of the mini YARN cluster. 
> Any tests that use the mini YARN cluster will automatically work with this 
> change. Although any that explicitly deal with mrapp-generated-classpath can 
> be simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-129) Simplify classpath construction for mini YARN tests

2012-09-25 Thread Tom White (JIRA)
Tom White created YARN-129:
--

 Summary: Simplify classpath construction for mini YARN tests
 Key: YARN-129
 URL: https://issues.apache.org/jira/browse/YARN-129
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Reporter: Tom White
Assignee: Tom White


The test classpath includes a special file called 'mrapp-generated-classpath' 
(or similar in distributed shell) that is constructed at build time, and whose 
contents are a classpath with all the dependencies needed to run the tests. 
When the classpath for a container (e.g. the AM) is constructed the contents of 
mrapp-generated-classpath is read and added to the classpath, and the file 
itself is then added to the classpath so that later when the AM constructs a 
classpath for a task container it can propagate the test classpath correctly.

This mechanism can be drastically simplified by propagating the system 
classpath of the current JVM (read from the java.class.path property) to a 
launched JVM, but only if running in the context of the mini YARN cluster. Any 
tests that use the mini YARN cluster will automatically work with this change. 
Although any that explicitly deal with mrapp-generated-classpath can be 
simplified.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-105) DEFAULT_YARN_APPLICATION_CLASSPATH is incomplete

2012-09-13 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454763#comment-13454763
 ] 

Tom White commented on YARN-105:


Isn't this being fixed in MAPREDUCE-4421? I think the idea is that YARN 
shouldn't be managing MR JARs itself.

> DEFAULT_YARN_APPLICATION_CLASSPATH is incomplete
> 
>
> Key: YARN-105
> URL: https://issues.apache.org/jira/browse/YARN-105
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Bo Wang
>Assignee: Bo Wang
> Fix For: 2.0.0-alpha
>
> Attachments: YARN-105.patch
>
>
> Currently, DEFAULT_YARN_APPLICATION_CLASSPATH only includes following:
> "$HADOOP_CONF_DIR", "$HADOOP_COMMON_HOME/share/hadoop/common/*",
> "$HADOOP_COMMON_HOME/share/hadoop/common/lib/*",
> "$HADOOP_HDFS_HOME/share/hadoop/hdfs/*",
> "$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*",
> "$YARN_HOME/share/hadoop/yarn/*",
> "$YARN_HOME/share/hadoop/yarn/lib/*"
> However, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/* and 
> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/* are missing. All classes of 
> mapreduce projects can't be found.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-81) Make sure YARN declares correct set of dependencies

2012-09-04 Thread Tom White (JIRA)
Tom White created YARN-81:
-

 Summary: Make sure YARN declares correct set of dependencies
 Key: YARN-81
 URL: https://issues.apache.org/jira/browse/YARN-81
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.1.0-alpha
Reporter: Tom White


This is the equivalent of HADOOP-8278 for YARN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira