[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-04-05 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226204#comment-15226204
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I am not suggesting there is a DNS based solution.  I am not a DNS expert and 
was hopeful there could at least be a DNS based mitigation possible, but that 
hope has now faded.

I wanted to bring it up for discussion as part of the design so we go into this 
with our eyes wide open, and that at a minimum documenting it with examples for 
"fixing" it becomes a part of the final product.  That did not happen for the 
initial registry service, but probably should have.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-04-05 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226196#comment-15226196
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I think that is through the naming convention, and the DNS configuration on the 
desktop in a foreign county.

I imagine that if I were doing this I would set it up so that the Hadoop DNS 
server would handle a set of sub-domains in my companies internal DNS setup.  
Then when my desktop is setup, or when my laptop connects to the VPN, the DNS 
server that it talks to would be configured to include one that also knows 
about the Hadoop setup.

But that is just my guess.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214258#comment-15214258
 ] 

Robert Joseph Evans commented on YARN-4757:
---

There are lots of ways to "fix" these issues on a case by case basis.  I mostly 
want to be sure that any documentation around YARN and service discovery is 
very clear that there are inherent races that can happen on shared 
infrastructure.  YARN/Slider cannot fix them for end users and any client 
talking to a secure application/server should validate that the server is the 
correct and expected server.  Concrete examples of how to do this would be 
great.  This is not a new issue.  It has existed since the registry service was 
first implemented.  We are simply making it much easier for a user to integrate 
off the shelf components that are coming from a more traditional 
infrastructure/deployment where this is not necessarily a concern.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210513#comment-15210513
 ] 

Robert Joseph Evans commented on YARN-4757:
---

My concern about mutual authentication is mostly around documentation and 
asking if there is anything we can do to mitigate possible issues/attacks.  
Instead of talking about exact attacks lets talk about a few accidents that 
could happen, and can happen today, but are less likely because when I update 
my client to use the Registry API I make different assumptions about things.

Lets say I am running a web service on YARN, and I want my customers to be able 
to get to me in through existing tools.  So I set this all up and I have them 
go to http://api.bobby.yarncluster.myCompany.com:/ (or something else that 
matches the naming convention you had, I don't remember exactly and it is not 
relevant)  First of all I have no way to guarantee that  is open on any 
node, so it is a pain but I try to launch several web servers, finally get a 
few to come up, the others fail and get relaunched on other nodes eventually 
they are all up and running, and the AM puts all of them into the registry 
service.

Things are going well.  Customers are running using my service and everyone is 
happy.  But then I do a rolling upgrade and I kill one container and launch a 
new one on another box.  In the mean time some other container on the box I was 
running on grabs port  and brings up an internal web UI for it.  Now many 
of my customers trying to hit my web service but get this other process and see 
404 errors, etc.  Because DNS is eventual consistency, and there is a lot of 
caching happening, there is a race.  If the client does not authenticate the 
server, like with https, then someone malicious could exploit this to do all 
kinds of things.

I am simply saying that many people trust DNS a lot more than they should in 
their protocols, more so when they feel that they have DNSSEC turned on 
internally and they are going to an internal address that they can "trust".  By 
exposing YARN through DNS it did not make it any less secure, it just made it a 
lot simpler for someone to deploy something that is insecure.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210465#comment-15210465
 ] 

Robert Joseph Evans commented on YARN-4757:
---

[~jmaron],

Thanks for the answers.

As for the SRV records and which IP address is returned it might be good to 
make it more clear in the document what you are proposing. 

Security makes since, it is almost exactly the same as what we do for storm (I 
really wish ZK had delegation tokens though).

My concern was not about do we have the ability to return multiple addresses.  
My concern was mostly about how many can we return.  Typically the address 
returned for google.com, etc are actually pointing to a very large load 
balancer and not individual web servers.  So the number of entries returned is 
on the order of the number of data centers someone has, or even more likely it 
is even higher level and it is around the number of geographic regions.  At 
Yahoo we run very large HBase clusters.  I'm not sure how well tools would 
handle getting back 2000 IP addressed for a record.  I mostly want to 
understand what if any theoretical limits there are to this technology and what 
if any practical limits there are.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209067#comment-15209067
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I also did a quick pass through the document and I wanted to clarify a few 
things.

So in some places in the document, like with names that map to containers and 
names that map to components it says something like "If Available" indicating 
that if an IP address is not assigned to the individual container no mapping 
will be made.  Am I interpreting that correctly? Are there situations where you 
would just return the IP Address of the node the container is running on? Am I 
just mistaken in my interpretation and there are different situations where we 
could launch a container that would have no IP address available.

However for the per application records there is no such conditional.  Does 
that mean that we will return records for any service API no matter how the IP 
Addresses are assigned, or there is no way for the IP Address to not be 
available?

Also I am not super familiar with the slider registry so perhaps you could 
clarify a few things there too.

How is authentication with zookeeper handled?  Is it always SASL+kerberos?  
Just because the doc mentions that the RM has to set up the base user directory 
with permissions.  Would then any secure slider app that wants to use the 
registry be required to ship a keytab with their application?

Also I am not super familiar with the existing registry API, from the example 
in the doc it shows a few different types of services that an Application 
Master can register.  Both Host/Port and URI.  Would we be exposing SRV records 
for both of these combinations?  If so how would they be named?

I am also curious about limits to various DNS fields both in the protocol and 
in practice with common implementations.  I am not an expert on DNS so if I say 
something silly after you stop laughing please let me know.  The document talks 
a lot about doing character remapping and having to have unique application 
names, but it does not talk about limits to the lengths of those names (I have 
seen some DNS servers don't support more then 254 character names).  What about 
limits on the number of IP addresses that can be returned for a given name.  I 
could not find anything specific but I have to assume that in practice most 
systems don't support a huge number of these, and large clusters on YARN can 
easily launch hundreds or even thousands of containers for a given service.

In addition to Allen's concerns the document does not seem to address/call out 
my initial concerns about requiring mutual authentication, or handling of port 
availability in scheduling.



> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193769#comment-15193769
 ] 

Robert Joseph Evans commented on YARN-4757:
---

[~aw], I am not expert on DNS so it is good to hear that you have thought 
through this and done your homework.  I read up a little on SRV records and it 
looks like a good fit.  It still does not change the need for 2 way 
authentication and making sure that we can restrict who registers for a 
service, but because SRV records are not a drop in replacement for A/CNAME 
records it should not be as big of an issue.

Clients are likely going to need to make changes to support SRV records, and 
from what I can tell java does not come with built in support, not the end of 
the world, but also likely non-trivial.  Especially when it looks like the 
industry has not decided on how they want to support http. (Although I could be 
wrong on all of that, because like I said I am not an expert here)

I just want to be sure that you are thinking things through, and it looks like 
you are so I am happy.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193577#comment-15193577
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I am +1 on the idea of using DNS for long lived service discovery, but we need 
to be very very careful about security.  If we are not all of the problems 
possible with https://en.wikipedia.org/wiki/DNS_spoofing would likely be 
possible with this too.  We need to be positive that we can restrict the names 
allowed so there are no conflicts with other servers on the network/internet.  
Additionally if we make this super simple, which is the entire goal here, then 
we are covering up some really potentially serious issues with client code, 
that a normal server running off YARN would not expect to have.  It really 
comes down to any service running on YARN that wants to be secure needs to have 
2 way authentication client authenticates server and server authenticates 
clients.  There are timing attacks and other things that can happen when a 
process crashes and lets go of a port.  Internal web services especially feel 
vulnerable because unless you enable SSL they will be insecure, something that 
many groups avoid on internal services because of the extra overhead of doing 
encryption.

Do you plan on handling ephemeral ports in some way? As far as I know there is 
no standard for including port(s) in a DNS entry.  If we do come up with 
something that is non-standard doesn't that still necessitate client side 
changes which was an expressed goal of this JIRA?  If we don't handle ephemeral 
ports are we going to add in mesos-like scheduling of ports?

  

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service­-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent end­points of a service is not easy to implement 
> using the present registry­-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-­known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3605) _ as method name may not be supported much longer

2015-05-18 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548375#comment-14548375
 ] 

Robert Joseph Evans commented on YARN-3605:
---

This is not a newbie issue.  The code that has the _ method in it is generated 
code, and the code that generates it is far from simple.  This is also 
technically a backwards incompatible change, because other YARN applications 
could be using it.

 _ as method name may not be supported much longer
 -

 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans

 I was trying to run the precommit test on my mac under JDK8, and I got the 
 following error related to javadocs.
  
  (use of '_' as an identifier might not be supported in releases after Java 
 SE 8)
 It looks like we need to at least change the method name to not be '_' any 
 more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3605) _ as method name may not be supported much longer

2015-05-18 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3605:
--
Labels:   (was: newbie)

 _ as method name may not be supported much longer
 -

 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans

 I was trying to run the precommit test on my mac under JDK8, and I got the 
 following error related to javadocs.
  
  (use of '_' as an identifier might not be supported in releases after Java 
 SE 8)
 It looks like we need to at least change the method name to not be '_' any 
 more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer

2015-05-08 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-644:
-
Labels:   (was: BB2015-05-RFC)

 Basic null check is not performed on passed in arguments before using them in 
 ContainerManagerImpl.startContainer
 -

 Key: YARN-644
 URL: https://issues.apache.org/jira/browse/YARN-644
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Omkar Vinit Joshi
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-644.001.patch, YARN-644.002.patch, 
 YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch


 I see that validation/ null check is not performed on passed in parameters. 
 Ex. tokenId.getContainerID().getApplicationAttemptId() inside 
 ContainerManagerImpl.authorizeRequest()
 I guess we should add these checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3605) _ as method name may not be supported much longer

2015-05-08 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created YARN-3605:
-

 Summary: _ as method name may not be supported much longer
 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans


I was trying to run the precommit test on my mac under JDK8, and I got the 
following error related to javadocs.
 
 (use of '_' as an identifier might not be supported in releases after Java SE 
8)

It looks like we need to at least change the method name to not be '_' any 
more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer

2015-05-08 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534803#comment-14534803
 ] 

Robert Joseph Evans commented on YARN-644:
--

Thanks [~varun_saxena], I agree with [~gtCarrera9] +1.  I'll check this in.

 Basic null check is not performed on passed in arguments before using them in 
 ContainerManagerImpl.startContainer
 -

 Key: YARN-644
 URL: https://issues.apache.org/jira/browse/YARN-644
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Omkar Vinit Joshi
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-644.001.patch, YARN-644.002.patch, 
 YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch


 I see that validation/ null check is not performed on passed in parameters. 
 Ex. tokenId.getContainerID().getApplicationAttemptId() inside 
 ContainerManagerImpl.authorizeRequest()
 I guess we should add these checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534957#comment-14534957
 ] 

Robert Joseph Evans commented on YARN-3148:
---

The changes look fine to me.  Not sure why the patch could not apply.  The 
queue is full right now, so I will try to run things manually.

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3148:
--
Labels:   (was: BB2015-05-RFC)

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3148:
--
Labels: BB2015-05-RFC  (was: )

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-11 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058745#comment-14058745
 ] 

Robert Joseph Evans commented on YARN-2261:
---

+1 either approach seems fine to me.  Vinod's requires an opt in, which is nice 
from a backwards compatibility standpoint.  Also do we want to count the 
cleanup container as a running application?  We definitely need to count its 
resources against any queue it is a part of, but for a queue that is configured 
to run mostly large applications, it could have other applications back up 
behind the cleanup containers.

 YARN should have a way to run post-application cleanup
 --

 Key: YARN-2261
 URL: https://issues.apache.org/jira/browse/YARN-2261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 See MAPREDUCE-5956 for context. Specific options are at 
 https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-11 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059060#comment-14059060
 ] 

Robert Joseph Evans commented on YARN-2261:
---

Yes and that is not necessarily a good thing.  Especially if cleanup can take a 
relatively long period of time.

 YARN should have a way to run post-application cleanup
 --

 Key: YARN-2261
 URL: https://issues.apache.org/jira/browse/YARN-2261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 See MAPREDUCE-5956 for context. Specific options are at 
 https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-08 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055588#comment-14055588
 ] 

Robert Joseph Evans commented on YARN-611:
--

Why is the reset policy created on a per app ATTEMPT basis? Shouldn't it be on 
a per application basis.  Wouldn't having more then one 
WindowsSlideAMRetryCountResetPolicy per application be a waste as they will 
either be running in parallel racing with each other, or there will be extra 
overhead to stop and start them for each application attempt?

Inside WindowsSlideAMRetryCountResetPolicy you create a new Timer.  Timer 
instances create a new thread, I am not sure we really need a new thread for 
potentially each application, just so the thread can wakeup every few seconds 
to reset a counter.

Inside WindowsSlideAMRetryCountResetPolicy.amRetryCountReset we call 
rmApp.getCurrentAppAttempt() in a loop.  Why don't we cache it?

I also don't really like how the code handles locking.  To me it always feels 
bad to hold a lock while calling into a class that may call back into you, 
especially from a different thread.  The WindowsSlideAMRetryCountResetPolicy 
calls into getAppAttemptId, shouldCountTowardsMaxAttemptRetry, 
mayBeLastAttempt, and setMaybeLastAttemptFlag of RmAppAttemptImpl. 
RmAppAttemptImpl calls into start, stop, and recover for the resetPolicy.  
Right now I don't think there are any potential deadlocks because 
RmAppAttemptImpl never holds a lock while interacting directly with 
resetPolicy, but if it ever does then it could deadlock.  I'm not sure of a 
good way to fix this, except perhaps through comments in the ResetPolicy 
interface specifying that start/stop/recover will never be called while holding 
a lock for RMAppAttempt or RMApp.

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since the retry count for the AM is never reset, eventually, at 
 some point, the number of machine/NM failures will result in the AM failure 
 count going above the configured value for 
 yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
 job as failed, and shut it down. This behavior is not ideal.
 I propose that we add a second configuration:
 yarn.resourcemanager.am.retry-count-window-ms
 This configuration would define a window of time that would define when an AM 
 is well behaved, and it's safe to reset its failure count back to zero. 
 Every time an AM fails the RmAppImpl would check the last time that the AM 
 failed. If the last failure was less than retry-count-window-ms ago, and the 
 new failure count is  max-retries, then the job should fail. If the AM has 
 never failed, the retry count is  max-retries, or if the last failure was 
 OUTSIDE the retry-count-window-ms, then the job should be restarted. 
 Additionally, if the last failure was outside the retry-count-window-ms, then 
 the failure count should be set back to 0.
 This would give developers a way to have well-behaved AMs run forever, while 
 still failing mis-behaving AMs after a short period of time.
 I think the work to be done here is to change the RmAppImpl to actually look 
 at app.attempts, and see if there have been more than max-retries failures in 
 the last retry-count-window-ms milliseconds. If there have, then the job 
 should fail, if not, then the job should go forward. Additionally, we might 
 also need to add an endTime in either RMAppAttemptImpl or 
 RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
 failure.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-07 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054168#comment-14054168
 ] 

Robert Joseph Evans commented on YARN-611:
--

Why are you using java serialization for the retry policy?  There are too many 
problems with java serialization, especially if we are persisting it into a DB, 
like the state store.  Please switch to using something like protocol buffers 
that will allow for forward/backward compatible modifications going forward.

in the javadocs for RMApp.setRetryCount it would be good to explain what retry 
count actually is and does.

In the constructor for RMAppAttemptImpl there is special logic to call setup 
only for WindowsSlideAMRetryCountResetPolicy.  This completely loses the 
abstraction that the AMResetCountPolicy interface should be providing.  Please 
update the interface so that you don't need special case code for a single 
implementation.

In RMAppAttemptImpl you mark setMaybeLastAttemptFlag as Private this really 
should have been done in the parent interface. In the parent interface you also 
add in myBeLastAttempt() this too should be marked as Private and both of them 
should have comments that these are for testing.

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since the retry count for the AM is never reset, eventually, at 
 some point, the number of machine/NM failures will result in the AM failure 
 count going above the configured value for 
 yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
 job as failed, and shut it down. This behavior is not ideal.
 I propose that we add a second configuration:
 yarn.resourcemanager.am.retry-count-window-ms
 This configuration would define a window of time that would define when an AM 
 is well behaved, and it's safe to reset its failure count back to zero. 
 Every time an AM fails the RmAppImpl would check the last time that the AM 
 failed. If the last failure was less than retry-count-window-ms ago, and the 
 new failure count is  max-retries, then the job should fail. If the AM has 
 never failed, the retry count is  max-retries, or if the last failure was 
 OUTSIDE the retry-count-window-ms, then the job should be restarted. 
 Additionally, if the last failure was outside the retry-count-window-ms, then 
 the failure count should be set back to 0.
 This would give developers a way to have well-behaved AMs run forever, while 
 still failing mis-behaving AMs after a short period of time.
 I think the work to be done here is to change the RmAppImpl to actually look 
 at app.attempts, and see if there have been more than max-retries failures in 
 the last retry-count-window-ms milliseconds. If there have, then the job 
 should fail, if not, then the job should go forward. Additionally, we might 
 also need to add an endTime in either RMAppAttemptImpl or 
 RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
 failure.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers

2014-06-13 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030643#comment-14030643
 ] 

Robert Joseph Evans commented on YARN-2140:
---

We are working on similar things for storm.  I am very interested in your 
design, because for any streaming system to truly have a chance on YARN soft 
guarantees on network I/O are critical.  There are several big problems with 
network I/O even if the user can effectively estimate what they will need.  The 
first is that the resource is not limited to a single node in the cluster.  The 
network has a topology and a bottlekneck can show up at any point in that 
topology.  So you may think you are fine because each node in a rack is not 
scheduled to be using the full bandwidth that the network card(s) can support.  
But you can easily have saturated the top of rack switch without knowing it.  
To solve this problem you effectively have to know the topology of the 
application itself.  So that you can schedule the node to node network 
connections within that application. if users don't know how much network they 
are going to use at a high level, they will never have any idea at a low level. 
 But then you also have the big problem of batch being very bursty in its 
network usage.  The only way to solve this is going to require network hardware 
support for prioritizing packets.

But I'll wait for your design before writing too much more.

 Add support for network IO isolation/scheduling for containers
 --

 Key: YARN-2140
 URL: https://issues.apache.org/jira/browse/YARN-2140
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-01-15 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872300#comment-13872300
 ] 

Robert Joseph Evans commented on YARN-1530:
---

I agree that we need to think about load and plan for something that can handle 
at least 20x the current load but preferably 100x.  However, I am not that sure 
that the load will be a huge problem at least for current MR clusters.  We have 
seen very large jobs as well, but 700 MB history file job does not finish 
instantly.  I took a look at a 3500 node cluster we have that is under fairly 
heavy load, and looking at the done directory for yesterday, I saw what 
amounted to about 1.7MB/sec of data on average.  Gigabit Ethernet should be 
able to handle 15 to 20 times this (assuming that we read as much data as we 
write, and that the storage may require some replication).

I am fine with the proposed solution by [~lohit] so long as the history service 
always provides a restful interface and the AM can decide if it wants to use 
it, or go through a different higher load channel.  Otherwise non-java based 
AMs would not necessarily be able to write to the history service.

I am also a bit nervous about using the history service for recovery or as a 
backend for the current MR APIs if we have a pub/sub system as a link between 
the applications and the history service.  I don't think it is a show stopper, 
it just opens the door for a number of corner cases that will have to be dealt 
with, like an MR AM crashes badly and the client goes to the history service to 
get the counters/etc, when does the history service know that all of the events 
for the MR AM have been processed so it can return those counters, or perhaps 
other data?  I am not totally sure what data may be a show stopper for this, 
but the lag means all applications have to be sure that they don't use the 
history service for split brain problems or things like that.

 [Umbrella] Store, manage and serve per-framework application-timeline data
 --

 Key: YARN-1530
 URL: https://issues.apache.org/jira/browse/YARN-1530
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
 Attachments: application timeline design-20140108.pdf


 This is a sibling JIRA for YARN-321.
 Today, each application/framework has to do store, and serve per-framework 
 data all by itself as YARN doesn't have a common solution. This JIRA attempts 
 to solve the storage, management and serving of per-framework data from 
 various applications, both running and finished. The aim is to change YARN to 
 collect and store data in a generic manner with plugin points for frameworks 
 to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-321) Generic application history service

2013-12-23 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855707#comment-13855707
 ] 

Robert Joseph Evans commented on YARN-321:
--

The way it currently works is based off of group permissions on a directory 
(this is from memory from a while ago so I could be off on a few things).  In 
HDFS when you create a file the group of the file is the group of the directory 
the file is a part of, similar to the sticky bit on a directory in Linux.  When 
an MR job completes it will copy it's history log file, along with a few other 
files, to a drop box like location called intermediate done and atomically 
rename it from a temp name to the final name.  The directory is world writable, 
but only readable by a special group that the history server is a part of, but 
general users are not.  The history server then wakes up periodically and will 
scan that directory for new files, when it sees new files it will move them to 
a final location that is owned by the headless history server user.  If a query 
comes in for a job that the history server is not aware of, it will also scan 
the intermediate done directory before failing.

Reading history data is done through RPC to the history server, or through the 
web interface, including RESTful APIs.  There is no supported way for an app to 
read history data directly though the file system.  I hope this helps.

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-10-25 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805618#comment-13805618
 ] 

Robert Joseph Evans commented on YARN-941:
--

That sounds like a great default.  I would like to also have a way for an AM to 
say I can handle updating tokens without being shot, but that may be something 
that shows up in a follow on JIRA.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-321) Generic application history service

2013-10-09 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790555#comment-13790555
 ] 

Robert Joseph Evans commented on YARN-321:
--

I like the diagrams, but I want to understand if the generic application 
history service is intended to replace the job history server, or to just 
augment it?

I would prefer it if we could replace the current server. Perhaps not in the 
first release, but eventually.  To make that work we would have to provide a 
way for MR specific code to come up and run inside the service, exposing both 
the current restful web service, an application specific UI, and the RPC server 
that we currently run.

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-09 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-913:
-

Attachment: RegistrationServiceDetails.txt

Uploading a file that shows some examples of the registration service APIs.  
Any feedback on them is appreciated.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans
 Attachments: RegistrationServiceDetails.txt


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-04 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-913:


Assignee: Robert Joseph Evans

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans

 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786299#comment-13786299
 ] 

Robert Joseph Evans commented on YARN-913:
--

Yes it does have plenty of races.  I'll try to get some detailed designs up 
shortly but at a high level the general idea is to have a restful web service.  
For the most common use case there just needs to be two interfaces.

  - Register/Monitor a Service
  - Query for Services

Part of the reason we need the service registry is to securely verify that a 
client is talking to the real service, and no one has grabbed the service's 
port after it registered.  To do that I want to have the concept of a verified 
service.  For that we would need an admin interface for adding, updating, and 
removing verified services.

The registry would provide a number of pluggable ways for services to 
authenticate.  Part of adding a verified service would include indicating which 
authentication models the service can use to register and which users are 
allowed to register that service.

The registry could also act like a trusted Certificate Authority.  Another part 
of adding in a verified service would include indicating how clients could 
verify they are talking to the true service.  This could include just 
publishing an application id so the client can go to the RM and get a 
delegation token. Another option would be having the service generate a 
public/private key pair.  When the service registers it would get the private 
key and the public key would be available through the discovery interface.

The plan is to also have the registry monitor the service similar to ZK.  The 
service would heartbeat in to the registry periodically (could be on the order 
of mins depending on the service) after a certain period of time of inactivity 
the service would be removed from the registry. Perhaps we should add in an 
explicit unregister as well.  

I want to make sure that the data model it is generic enough that we could 
support something like a web service on the gird where each server can register 
itself and all of them would show up in the registry, so a service could have 
one or more servers that are a part of it, and each server could have some 
separate metadata about it.

I also want to have a plug-in interface for discovery, so we could potentially 
make the registry look like a DNS server or an SSL Certificate Authority which 
would make compatibility with existing applications and clients a lot simpler. 



 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans

 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-10-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785556#comment-13785556
 ] 

Robert Joseph Evans commented on YARN-624:
--

[~curino] Sorry about the late reply.  I have not really tested this much with 
storm on YARN.  Most of our experiments it is negligible the amount of time it 
takes to get nodes.  But we have not really done anything serious with it, and 
adding new nodes right now is a manual operation.

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-30 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754787#comment-13754787
 ] 

Robert Joseph Evans commented on YARN-896:
--

I agree that providing a good way handle stdout and stderr is important. I 
don't know if I want the NM to be doing this for us though, but that is an 
implementation detail that we can talk about on the follow up JIRA.  Chris, 
feel free to file a JIRA for rolling of stdout and stderr and we can look into 
what it will take to support that properly.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-19 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743819#comment-13743819
 ] 

Robert Joseph Evans commented on YARN-896:
--

[~criccomini],

That is a great point.  To do this we need the application to somehow inform 
YARN that it is a long lived application.  We could do this either through some 
sort of metadata that is submitted with the application to YARN, possibly 
through the service registry, or even perhaps just setting the progress to a 
special value like -1.  I think I would prefer the first one, because then YARN 
could use that metadata later on for other things.  After that the UI change 
should not be too difficult.  If you want to file a JIRA for it, either as a 
sub task or just link it in, that would be great.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2013-08-15 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741170#comment-13741170
 ] 

Robert Joseph Evans commented on YARN-810:
--

Sorry I am a bit late to this discussion.  I don't like the config to be 
global.  I think it needs to be on a per container basis.

{quote}There are certain cases where this is desirable. There are also certain 
cases where it might be desirable to have a hard limit on CPU usage, and not 
allow the process to go above the specified resource requirement, even if it's 
available.{quote}

The question is are there ever two different applications running on the same 
cluster where it is desirable for one, and not for the other.  I believe that 
is true.  I argued this in YARN-102 where you want to measure how long an 
application will take to run under a specific CPU resource request.  If I allow 
it to go over I will never know how long it would take worst case, and so I 
will never know if my config is correct unless I can artificially limit it.  
But in production I don't want to run worst case every time, and I don't want a 
special test cluster to see what the worst case is.  

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza

 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 

[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-15 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741518#comment-13741518
 ] 

Robert Joseph Evans commented on YARN-1024:
---

I am fine with that too.

 Define a virtual core unambigiously
 ---

 Key: YARN-1024
 URL: https://issues.apache.org/jira/browse/YARN-1024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We need to clearly define the meaning of a virtual core unambiguously so that 
 it's easy to migrate applications between clusters.
 For e.g. here is Amazon EC2 definition of ECU: 
 http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
 Essentially we need to clearly define a YARN Virtual Core (YVC).
 Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
 equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739853#comment-13739853
 ] 

Robert Joseph Evans commented on YARN-1024:
---

{quote}Sorry for the longwindedness.{quote}

From what people have told me you still have a long ways to go before you 
approach me for longwindedness :).

My initial gut reaction is that only having two numbers to express the request 
seems too simplified, but the more I think about it the more I am OK with it, 
although I think I would change the numbers to be total YCUs requested and 
minimum YCUs per core.  This gives the user better viability into how the 
scheduler is treating these numbers so they can better reason about them. The 
total YCUs is the value used for scheduling.  The minimum YCUs per core is 
compared to the maxComputeUnitsPerCore like was suggested to reject a request 
as not possible, or in the case of a heterogeneous environment restrict the 
hosts that this container can run on.  Although I am OK with the original 
proposal too.

I would also like us to have a flag that would either limit the container to 
the requested CPU and let it have no more even when more is available, or would 
let it expand to use whatever CPU was free, but would be guaranteed to get at 
least the YCUs requested.  This is likely something that would have to be done 
on a separate JIRA though.  Without this I don't see a way to really get 
simplicity, predictability, or consistency.  1 MB of RAM is fairly simple to 
understand.  It can be measured without too much of a problem just by running 
the process.  Most user do a simple search for the correct value run with the 
default, if it does not work I increase the amount and run again.  1 YCU is 
very complex to measure for an application.  If I cannot restrict a container 
to never use more than what was requested I cannot consistently predict how 
long it will take to run later.  Without this I don't know how to answer the 
question I know will come up.

What should I set these values to?


 Define a virtual core unambigiously
 ---

 Key: YARN-1024
 URL: https://issues.apache.org/jira/browse/YARN-1024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We need to clearly define the meaning of a virtual core unambiguously so that 
 it's easy to migrate applications between clusters.
 For e.g. here is Amazon EC2 definition of ECU: 
 http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
 Essentially we need to clearly define a YARN Virtual Core (YVC).
 Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
 equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-08-12 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736894#comment-13736894
 ] 

Robert Joseph Evans commented on YARN-624:
--

From my perspective it does not really solve the problem for me.  It comes 
close but is not perfect.  I am interested in gang scheduling to support 
[storm on yarn|https://github.com/yahoo/storm-yarn/]

The biggest issue I have with this design is knowing the size before the 
application is launched.  The ultimate goal with storm is to have a system 
where multiple separate, but related, storm topologies are processing data 
using the same application.  We would configure the queues so that if storm 
sees a spike in demand it can steal containers from batch processing to grow a 
topology and when the spike goes away it would release those containers back.  
If the number of containers changes dynamically, by both submitting new 
topologies and growing/shrinking existing ones it is impossible to tell YARN 
what I need at the beginning.

Gang scheduling is interesting for me because there is a specific number of 
containers that each topology is configured to need when that topology is 
launched.  Without all of those containers there is no reason to launch a 
single part of the topology. I can see this happening with a modification to 
your approach where the all or nothing happens when the AM submits a request, 
and not when the AM is submitted.

I also have a hard time seeing how this would work well with other advanced 
features like preemption.  For preemption to work well with gang scheduling it 
needs to take into account that if it shoots a container in a gang of 
containers it is likely going to get back a lot more resources then just one 
container.  If it is aware of this then it can still shoot the container, but 
avoid shooting other containers needlessly because it knows what it is going to 
get back.

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-12 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736919#comment-13736919
 ] 

Robert Joseph Evans commented on YARN-1024:
---

Perhaps I am missing something here.  The goals Arun has asked for are 
simplicity, predictability, and consistency.  Simplicity I totally agree with, 
but I do not totally agree with always having predictability and consistency 
after simplicity, and I do not agree that they are always required.  These two 
come with a trade-off with utilization, and this is something that Sandy 
brought up, although not directly.  For HBase guaranteed resources, in terms of 
both parallelism and raw CPU speed are important because it is using those to 
provide a service where predictability and consistency are needed. If the HBase 
AM cannot truly express to YARN what it needs because of simplicity HBase on 
YARN will not be used, because it will not behave the way users need/expect it 
to.  Similarly if HBase is allowed to steal resources from others you can 
easily request too little resources on an underutilized cluster and when the 
cluster is under load it falls apart.

This is similar for me with my desire for Storm on YARN.  I am happy to use a 
complex API to express my needs if it means that I get what I need.  On the 
other hand, if I am doing MR batch processing most of the time (but not all of 
it) I am doing single threaded processing and I really just want it to fill in 
the gaps and use as much unused CPU as it can.  Yes, some MR jobs have strict 
SLAs but most do not and it is best if we can provide a scheduler that can 
balance both.

I also don't agree that because YARN lacks the ability to schedule everything 
that impacts performance, including network and disk IO, that we should skip 
doing CPU correctly.  Some applications are truly CPU bound and they will 
benefit.  For other resources we can add them to YARN as they are needed until 
we do meet the goal of predictability and consistency.

 Define a virtual core unambigiously
 ---

 Key: YARN-1024
 URL: https://issues.apache.org/jira/browse/YARN-1024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We need to clearly define the meaning of a virtual core unambiguously so that 
 it's easy to migrate applications between clusters.
 For e.g. here is Amazon EC2 definition of ECU: 
 http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
 Essentially we need to clearly define a YARN Virtual Core (YVC).
 Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
 equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-09 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734983#comment-13734983
 ] 

Robert Joseph Evans commented on YARN-896:
--

Sorry I have not responded sooner.  I have been out on vacation and had a high 
severity issue that has consumed a lot of my time.

[~lmccay] and [~thw] There are many different services that long lived 
processes need to communicate with.  Many of these services use tokens and 
others may not.  Each of these tokens or other credentials are specific to the 
services being accessed.  In some cases like with HBase we probably can take 
advantage of the existing renewal feature in the RM.  With other tokens or 
credentials it may be different, and may require AM specific support for them. 
I am not really that concerned with solving the renewal problem for all 
possible credentials here, although if we can solve this for a lot of common 
tokens at the same time that would be great. What I care most about is being 
sure that a long lived YARN application does not necessarily have to stop and 
restart because an HDFS token cannot be renewed any longer.  If there are 
changes going into the HDFS security model that would make YARN-941 unnecessary 
that is great.  I have not had much time to follow the security discussion so 
thank you for pointing this out.  But it is also a question of time frames.  
YARN-941 and YARN-1041 would allow for secure, robust, long lived applications 
on YARN, and do not appear to be that difficult to accomplish.  Do you know the 
time frame for the security rework?

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-19 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713692#comment-13713692
 ] 

Robert Joseph Evans commented on YARN-896:
--

[~thw] I am not totally sure what you mean by app specific tokens.  Is this 
tokens that the app is going to use to connect to other services like HBase? or 
is it something else?

[~eric14] and [~enis] Rolling upgrades is a very interesting use case.  We can 
definitely add in a ticket to support this type of thing.  I agree that it 
needs to be thought through some, and is going to require help from both the AM 
and YARN to do it properly.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-07-19 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created YARN-941:


 Summary: RM Should have a way to update the tokens it has for a 
running application
 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans


When an application is submitted to the RM it includes with it a set of tokens 
that the RM will renew on behalf of the application, that will be passed to the 
AM when the application is launched, and will be used when launching the 
application to access HDFS to download files on behalf of the application.

For long lived applications/services these tokens can expire, and then the 
tokens that the AM has will be invalid, and the tokens that the RM had will 
also not work to launch a new AM.

We need to provide an API that will allow the RM to replace the current tokens 
for this application with a new set.  To avoid any real race issues, I think 
this API should be something that the AM calls, so that the client can connect 
to the AM with a new set of tokens it got using kerberos, then the AM can 
inform the RM of the new set of tokens and quickly update its tokens internally 
to use these new ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-07-19 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713961#comment-13713961
 ] 

Robert Joseph Evans commented on YARN-941:
--

I am punting on how/if we get the new HDFS token to NMs to be used for log 
aggregation.  We need to think a bit more about how logs should be handled for 
long lived services before we spend a lot of time trying to make log 
aggregation work.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-19 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713992#comment-13713992
 ] 

Robert Joseph Evans commented on YARN-896:
--

I filed one new JIRA for updating tokens in the RM YARN-941.

I started to file a JIRA for the AM to be informed of the location of its 
already running containers, but as I was writing it I realized that it will not 
give us enough information to be able to reattach to the containers.  The only 
thing it will give us is enough info to be able to go shoot the containers.  
Simply because there is no metadata about what port the container may be 
listening on or anything like that.  It seems to me that we would be better off 
keeping a log, similar to the MR job history log, that has in it all the data 
the AM needs to look for running containers.  If others see a different need 
for this API, I am still happy to file a JIRA for it.

I have not filed a JIRA for anti-affinity yet either.  I seem to remember 
another JIRA for something like this already, but I have not found it yet. I 
figure we can add in a long lived process flag for the scheduler when we run 
across a use case for it.

The other parts discussed here, either already have a JIRA associated with the 
same functionality, or I think need a bit more discussion about exactly what we 
want to do.  Namely log aggregation/processing and Hadoop package 
management/rolling upgrades of live applications.

If I missed something please let me know.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-10 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704623#comment-13704623
 ] 

Robert Joseph Evans commented on YARN-896:
--

Chris, Yes I missed the app master retry issue.  Those two with the discussion 
on them seem to cover what we are looking for.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-09 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703622#comment-13703622
 ] 

Robert Joseph Evans commented on YARN-896:
--

No comments in the past few days.  I would like to hear from more people 
involved, even if it is just to say that it looks like we have everything 
covered here.  Then we can start filing JIRAs and getting some work done.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698500#comment-13698500
 ] 

Robert Joseph Evans commented on YARN-896:
--

During the most recent Hadoop Summit there was a developer meetup where we 
discussed some of these issues.  This is to summarize what was discussed at 
that meeting and to add in a few things that have also been discussed on 
mailing lists and other places.

HDFS delegation tokens have a maximum life time. Currently tokens submitted to 
the RM when the app master is launched will be renewed by the RM until the 
application finishes and the logs from the application have finished 
aggregating.  The only token currently used by the YARN framework is the HDFS 
delegation token.  This is used to read files from HDFS as part of the 
distributed cache and to write the aggregated logs out to HDFS.

In order to support relaunching an app master after the HDFS the maximum 
lifetime of the HDFS delegation token, we either need to allow for tokens that 
do not expire or provide an API to allow the RM to replace the old token with a 
new one.  Because removing the maximum lifetime of a token reduces the security 
of the cluster as a whole I think it would be better to provide an API to 
replace the token with a new one.

If we want to continue supporting log aggregation we also need to provide a way 
for the Node Managers to get the new token too.  It is assumed that each app 
master will also provide an API to get the new token so it can start using it.


Log aggregation is another issue, although not required for long lived 
applications to work.  Logs are aggregated into HDFS when the application 
finishes.  This is not really that useful for applications that are never 
intended to exit.  Ideally the processing of logs by the node manager should be 
pluggable so that clusters and applications can select how and when logs are 
processed/displayed to the end user.  Because many of these systems roll their 
logs to avoid filling up disks we will probably need a protocol of some sort 
for the container to communicate with the Node Manager when logs are ready to 
be processed.

Another issue is to allow containers to out live the app master that launched 
them and also to allow containers to outlive the node manager that launched 
them.  This is especially critical for the stability of applications durring 
rolling upgrades to YARN.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698505#comment-13698505
 ] 

Robert Joseph Evans commented on YARN-896:
--

Another issue that has been discussed in the past is the impact that long lived 
processes can have on resource scheduling. It is possible for a long lived 
process to grab lots of resources and then never release them even though it is 
using more resources than it would be allowed to have when the cluster is full. 
 Recent preemption changes should be able to prevent this from happening 
between different queues/pools, but we may need to think if we need more 
control about this within a queue.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-05-20 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662101#comment-13662101
 ] 

Robert Joseph Evans commented on YARN-624:
--

I would love to have it right now for storm too. If you want me to sign up as a 
use case I am happy to. 

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-20 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662105#comment-13662105
 ] 

Robert Joseph Evans commented on YARN-690:
--

Vinod,

Yes creating and resolving a JIRA in 2 hours is not ideal, but this is a 
Blocker that consisted on only a handful of lines of change, also the bylaws 
explicitly state that a waiting period is not needed for this vote because 
committers can retroactively -1 and pull the change out.  I agree that waiting 
to let others look at the code is good and if it were not a Blocker I would 
have waited.

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: YARN-690.patch, YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-05-20 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662352#comment-13662352
 ] 

Robert Joseph Evans commented on YARN-624:
--

Storm is a real-time stream processing system.  We are working on porting this 
to run on YARN.  Storm will process one or more streams of data using a logical 
DAG of processing nodes called a topology.  This topology runs in spawned 
processes. If there are not enough processes to run a topology there is no 
point in launching any of the processes.  Hence the need for gang scheduling.

It is a very simple gang scheduling use case currently.  When a new topology is 
submitted we want to request enough resources to to run that topology.  If a 
node goes down, we are going to request enough resources to replace it, so we 
can get up and running again ASAP.  When a topology is killed we want to 
release those resources.

Long term we would like to make sure that the different containers are close to 
each other from a network topology perspective. We don't care which node or 
rack the containers are on, but we do care that they are all on the same 
node/rack as the other containers.

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660001#comment-13660001
 ] 

Robert Joseph Evans commented on YARN-690:
--

I don't think this does what you want.  Now an IOException will cause the same 
issue.  I think you need to handle runtime and IOException separately. 

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660035#comment-13660035
 ] 

Robert Joseph Evans commented on YARN-690:
--

The change looks fine to me now. +1

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch, YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-528) Make IDs read only

2013-04-30 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645991#comment-13645991
 ] 

Robert Joseph Evans commented on YARN-528:
--

The approach seems OK to me, but I would rather have the impl be an even 
thinner wrapper.

{code}
  private ApplicationIdProto proto = null;
  private ApplicationIdProto.Builder builder = null;

  ApplicationIdPBImpl(ApplicationIdProto proto) {
this.proto = proto;
  }

  public ApplicationIdPBImpl() {
this.builder = ApplicationIdProto.newBuilder();
  }
 
  public ApplicationIdProto getProto() {
assert (proto != null);
return proto;
  }
 
  @Override
  public int getId() {
assert (proto != null);
return proto.getId();
  }
 
  @Override
  protected void setId(int id) {
assert (builder != null);
builder.setId((id));
  }

  @Override
  public long getClusterTimestamp() {
assert(proto != null);
return proto.getClusterTimestamp();
  }
 
  @Override
  protected void setClusterTimestamp(long clusterTimestamp) {
assert(builder != null);
builder.setClusterTimestamp((clusterTimestamp));
  }

  @Override
  protected void build() {
assert(builder != null);
proto = builder.build();
builder = null;
  }
{code}

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: y528_AppIdPart_01_Refactor.txt, 
 y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt, 
 y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-528) Make IDs read only

2013-04-29 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644784#comment-13644784
 ] 

Robert Joseph Evans commented on YARN-528:
--

Thanks for doing this Sid.  I started pulling on the string and there was just 
too much involved, so I had to stop.

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: y528_AppIdPart_01_Refactor.txt, 
 y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt, 
 y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-528) Make IDs read only

2013-04-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621023#comment-13621023
 ] 

Robert Joseph Evans commented on YARN-528:
--

OK, I understand now.

I will try to find some time to play around with getting the AM ID to not have 
a wrapper at all.

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created YARN-528:


 Summary: Make IDs read only
 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans


I really would like to rip out most if not all of the abstraction layer that 
sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
no plans to support any other serialization type, and the abstraction layer 
just, makes it more difficult to change protocols, makes changing them more 
error prone, and slows down the objects themselves.  

Completely doing that is a lot of work.  This JIRA is a first step towards 
that.  It makes the various ID objects immutable.  If this patch is wel 
received I will try to go through other objects/classes of objects and update 
them in a similar way.

This is probably the last time we will be able to make a change like this 
before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-528:
-

Attachment: YARN-528.txt

This patch contains changes to both Map/Reduce IDs as well as YARN APIs.  I 
don't really want to split them up right now, but I am happy to file a separate 
JIRA for tracking purposes if the community decides this is a direction we want 
to go in.

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
 Attachments: YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-528:


Assignee: Robert Joseph Evans

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619911#comment-13619911
 ] 

Robert Joseph Evans commented on YARN-528:
--

The build failed, because it needs to be upmerged, again :(

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-528:
-

Attachment: YARN-528.txt

Upmerged

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620326#comment-13620326
 ] 

Robert Joseph Evans commented on YARN-528:
--

I am fine with splitting the MR changes from the YARN change like I said, I put 
this out here more to be a question of how do we want to go about implementing 
theses changes, and the test was more of a prototype example.

I personally lean more towards using the *Proto classes directly.  Why have 
something else wrapping it if we don't need it, even if it is a small and 
simple layer.  The only reason I did not go that route here is because of 
toString().  With the IDs we rely on having ID.toString() turn into something 
very specific that can be parsed and turned back into an instance of the 
object.  If I had the time I would trace down all places where we call toString 
on them and replace it with something else. I may just scale back the scope of 
the patch to look at ApplicationID to begin with and try to see if I can 
accomplish this.

bq. Wrapping the object which came over the wire - with a goal of creating 
fewer objects.

I really don't understand how this is supposed to work.  How do we create fewer 
objects by wrapping them in more objects? I can see us doing something like 
deduping the objects that come over the wire, but I don't see how wrapping 
works here.  

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-04-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618784#comment-13618784
 ] 

Robert Joseph Evans commented on YARN-515:
--

Having people always test the patches in secure mode I think is a bit too high 
of a barrier for some.  I personally hate having to get it all set up to be 
able to test a patch.  Registration responses in general were broken.  The NM 
would never get a reboot signal either.  It was always the default enum value 
of everything is fine.  I am just glad that we caught it. 

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Blocker
 Fix For: 2.0.5-beta

 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-29 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617367#comment-13617367
 ] 

Robert Joseph Evans commented on YARN-112:
--

Vinod,

I just glanced at the latest patch, I did not read it in detail, so if you say 
it covers that case I trust you.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617373#comment-13617373
 ] 

Robert Joseph Evans commented on YARN-515:
--

This issue appears to be cause by a bug in RegisterNodeManagerResponsePBImpl.  
I think specifically it was caused by YARN-440. I have a unit test that can 
reproduce it.  Sid reviewed YARN-440 and he is a really smart guy.  I looked at 
it thinking that it must be the cause of the issue and I didn't see anything in 
there that was off.

I just think all this extra code to try and wrap the protocol buffers is just a 
bad idea.  It makes things difficult to change a .proto file, and it just slows 
things down.  But it is a lot of work to change it so I am done with my rant 
now, I'll go find a fix for the issue.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617379#comment-13617379
 ] 

Robert Joseph Evans commented on YARN-515:
--

Yes the issue is that there is a rebuild flag in the PBImpl that is never set 
to true, so it will never rebuild the proto.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-515:
-

Attachment: YARN-515.txt

This should fix the issue.  We forgot to tell the wrapper to rebuild after 
setting some values.

There is a unit test included that shows the problem.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker
 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-515:


Assignee: Robert Joseph Evans

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Blocker
 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325
 ] 

Robert Joseph Evans commented on YARN-112:
--

I agree that scale exposes races but, still the underlying problem is that we 
want to create a new unique directory.  This seems very simple.

{code}
File uniqueDir = null;
do {
  uniqueDir = new File(baseDir, String.valueOf(rand.nextLong()));
} while (!uniqueDir.mkdir());
{code}

I don't see why we are going through all of this complexity, simply because a 
FileContext API is broken.  Playing games to make the race less likely is fine. 
 But ultimately we still have to handle the race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327
 ] 

Robert Joseph Evans commented on YARN-112:
--

Oh and the latest patch using a unique number will not always work, because the 
same code is used from different processes on the same box.  We would have to 
have a way to guarantee uniqueness between the different processes.  
CurrentTimeMillis helps but still could result in a race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created YARN-515:


 Summary: Node Manager not getting the master key
 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker


On branch-2 the latest version I see the following on a secure cluster.

{noformat}
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
enabled - updating secret keys now
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
with ResourceManager as RM:PORT with total resource of me
mory:12288, vCores:16
2013-03-28 19:21:06,244 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
started.
2013-03-28 19:21:06,245 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
{noformat}

The Null pointer exception just keeps repeating and all of the nodes end up 
being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616628#comment-13616628
 ] 

Robert Joseph Evans commented on YARN-515:
--

OK It actually looks like the NM is trying to get the Master Key, before it 
ever has set it, which is causing the NPE.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616698#comment-13616698
 ] 

Robert Joseph Evans commented on YARN-515:
--

This is really odd.  I put in logging in the ResourceTrackerService and in the 
NodeStatusUpdaterImpl.  The RM sets the secret key in the 
RegisterNodeManagerResponse, but the NM only sees a null come out for it.  
Because of that the heartbeat always fails with the NPE trying to read 
something that was never set.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-26 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614036#comment-13614036
 ] 

Robert Joseph Evans commented on YARN-378:
--

Hitesh and Vinod,

It is not a big deal. I realized that both were going in, and I am glad that 
this is ready and has gone in.  It is a great feature. It just would have been 
nice to either commit them at the same time, or give a heads up on the mailing 
list that you were going to break the build for a little while.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Fix For: 2.0.5-beta

 Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, 
 YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, 
 YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, 
 YARN-378_9.patch, YARN_378-final-commit.patch, 
 YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614402#comment-13614402
 ] 

Robert Joseph Evans commented on YARN-112:
--

I am not really sure that we fixed the underlying issue.  

{code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code}

threw an exception because there was something else in that directory already, 
but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a 
FileAlreadyExistsException if the directory already exists.  

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29

files.rename should never get into this situation if files.rename threw the 
exception when it was supposed to.

I tested this and 
{code}
FileContext lfc = FileContext.getLocalFSFileContext(new Configuration());
Path p = new Path(/tmp/bobby.12345);
FsPermission cachePerms = new FsPermission((short) 0755);
lfc.mkdir(p, cachePerms, false);
lfc.mkdir(p, cachePerms, false);
{code}

never throws an exception.  We first need to address the bug in FileContext, 
and then we can look at how we can make FSDownload deal with mkdir throwing an 
exception, or whatever the fix ends up being.

I filed HADOOP-9438 for this.

If the fix ends up being that we do not support throwing the exception in 
FileContext, then your current solution looks OK.

I also have a hard time believing that we are getting random collisions on a 
long value that should be fairly uniformly distributed.  We need to guard 
against it either way and I suppose it is possible, but if I remember correctly 
we were seeing a significant number of these errors and my gut tells me that 
there is either something very wrong with Random, or there is something else 
also going on here.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-25 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reopened YARN-378:
--


Looks like something was missed

{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) 
on project hadoop-mapreduce-client-app: Compilation failure: Compilation 
failure:
[ERROR] 
/home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[227,52]
 cannot find symbol
[ERROR] symbol  : variable RM_AM_MAX_RETRIES
[ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
[ERROR] 
/home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[228,25]
 cannot find symbol
[ERROR] symbol  : variable DEFAULT_RM_AM_MAX_RETRIES
[ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
{noformat}

Please fix this ASAP.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Fix For: 2.0.5-beta

 Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, 
 YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, 
 YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, 
 YARN-378_9.patch, YARN_378-final-commit.patch, 
 YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives

2013-03-21 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608919#comment-13608919
 ] 

Robert Joseph Evans commented on YARN-109:
--

The findbugs is complaining that you are ignoring the return value of the 
delete call.  It should not be a problem so either use the return value to log 
a warning when it fails or update the findbugs filter to filter out the error.

The -1 for the test timeouts is caused by a bug in the script used to detect 
these, so you can either ignore it, or add a timeout to any @Test that appears 
in the patch file, including the ones you didn't add :(.

In the test, please uncomment the lines to cleanup after the test.  Are they 
causing a problem for the test to pass? or was it just for debugging?

Also I personally would prefer to have a few small jar/tar/zip files checked 
into the repository instead of generating them on they fly for the test.  It 
will speed up the test and have less dependencies on the system being set up 
with the exact commands, i.e. bash for windows support.  Although if you don't 
feel like changing it I am fine with that too, most of those commands are used 
by FSDownload already so it is not that critical.

 .tmp file is not deleted for localized archives
 ---

 Key: YARN-109
 URL: https://issues.apache.org/jira/browse/YARN-109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3, 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Mayank Bansal
 Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch, 
 YARN-109-trunk-3.patch, YARN-109-trunk.patch


 When archives are localized they are initially created as a .tmp file and 
 unpacked from that file.  However the .tmp file is not deleted afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives

2013-03-21 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609509#comment-13609509
 ] 

Robert Joseph Evans commented on YARN-109:
--

That is fine with me.  My concern was mostly with Windows support. tar, zip, 
jar, etc. should be there, but bash may not be. So if you want to file a new 
JIRA that is fine, if not you can just wait for windows support to be merged in 
and see if it breaks.

 .tmp file is not deleted for localized archives
 ---

 Key: YARN-109
 URL: https://issues.apache.org/jira/browse/YARN-109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3, 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Mayank Bansal
 Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch, 
 YARN-109-trunk-3.patch, YARN-109-trunk-4.patch, YARN-109-trunk.patch


 When archives are localized they are initially created as a .tmp file and 
 unpacked from that file.  However the .tmp file is not deleted afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602333#comment-13602333
 ] 

Robert Joseph Evans commented on YARN-378:
--

Using the environment variables works for other applications too.  That is the 
only way to get some pieces of critical information that are needed for 
registration with the RM.  

On Windows there are limits 

http://msdn.microsoft.com/en-us/library/windows/desktop/ms682653%28v=vs.85%29.aspx

But they should not cause too much of an issue on Windows Server 2008 and above.

I would prefer for us to only return the information to the AM one way.  Either 
though thrift or through the environment variable just so there is less 
confusion, but I am not adamant about it.



 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, 
 YARN-378_7.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602341#comment-13602341
 ] 

Robert Joseph Evans commented on YARN-378:
--

Looking at the code too I am fine with renaming retries to attempts.  But we 
need to mark this JIRA as an incompatible change or put in a deprecated config 
mapping.  We are early enough in YARN that deprecating it seems like a waste.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, 
 YARN-378_7.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-226) Log aggregation should not assume an AppMaster will have containerId 1

2013-03-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602661#comment-13602661
 ] 

Robert Joseph Evans commented on YARN-226:
--

Big means amount of memory/CPU relative to the minimum allocation size.  For 
example you ask for a 4 GB container with a min allocation size of 500MB.

 Log aggregation should not assume an AppMaster will have containerId 1
 --

 Key: YARN-226
 URL: https://issues.apache.org/jira/browse/YARN-226
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth

 In case of reservcations, etc - AppMasters may not get container id 1. We 
 likely need additional info in the CLC / tokens indicating whether a 
 container is an AM or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-13 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601237#comment-13601237
 ] 

Robert Joseph Evans commented on YARN-378:
--

The patch looks good to me. The only problem I have is with how we are 
informing the AM of the maximum number of retires that it has.  This should 
work, but it is going to require a lot of changes to the MR AM to use it.  
Right now the number is used in the init of MRAppMaster, but we will not get 
that information until start() is called and we register with the RM.  I would 
much rather see a new environment variable added that can hold this 
information, because it makes MAPREDUCE-5062 much simpler.  But I am OK with 
the way it currently is.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-11 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599236#comment-13599236
 ] 

Robert Joseph Evans commented on YARN-237:
--

Sorry to keep adding more things in here, but JQuery.java is a generic part of 
YARN.  It, in theory, can be used by others not just Map/Reduce and YARN.  
Encoding in a special case for the tasks table is not acceptable.  You should 
be able to get the same functionality by switching to the DATATABLES_SELECTOR 
for those tables.

We also need to address the find bugs issues. You are dereferencing type to 
create the ID of the tasks and type could be null, although in practice it 
should never be.  Also there is no need to call toString() on type when using + 
with another string.  This may fix the find bugs issues too, although it would 
not be super clean.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch, YARN-237.v2.patch, YARN-237.v3.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-11 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599240#comment-13599240
 ] 

Robert Joseph Evans commented on YARN-378:
--

I am perfectly fine with that.  It seems like more overhead, but I am fine 
either way.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-08 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597428#comment-13597428
 ] 

Robert Joseph Evans commented on YARN-378:
--

From a quick look it seems OK.

It would be nice for isLastAMRetry to remain private and have a getter.  That 
way it prevents unintended writes to it.

I also don't really like having the AM guess how many retries there will be.  I 
thought it was ugly when I add that code, and now that it logic is more complex 
I really know why.  Could you please file a JIRA so the RM and inform the AM 
how many AM retires it has, or if you have time just add it in as part of this 
JIRA.  That way the AM will never have to adjust its logic again.

Also could we make the code a little more robust.  In both the AM and the RM 
instead of checking for just -1 could you check for anything that is = 0.  If 
anyone sets the retries to be that small it should use the default. I am not 
sure what having a max retries of -2 means and what it would do to an 
application.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-08 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597508#comment-13597508
 ] 

Robert Joseph Evans commented on YARN-237:
--

I have a few more comments.

It is great that you fixed the issues, but now we have a leak in the browser.  
You have tied the table ID to the localStorage key, and then for a couple of 
tables you have included the jobID in the table ID.  This means that new 
entries will be placed in the localStorage for every job page I visit and those 
entires will never be deleted.

I see two ways to fix this.  We can ether change it over to be sessionStorage 
instead of localStorage, because it goes away after the session ends.  Or we 
can remove the jobID from the table names. If we remove the jobID corresponding 
tables on different pages will share a single state.  If we use sessionStorage 
the data will only be saved for a given browser session.  If I close the 
browser and reopen it the state will be lost. I tend to think the first one is 
preferable, but that is just me. 

Also could you please update the code format to meet our guidelines.  There are 
a few places where it does not meet the guidelines.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch, YARN-237.v2.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-456) allow OS scheduling priority of NM to be different than the containers it launches for Windows

2013-03-07 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-456:
-

Summary: allow OS scheduling priority of NM to be different than the 
containers it launches for Windows  (was: Add similar support for Windows)

 allow OS scheduling priority of NM to be different than the containers it 
 launches for Windows
 --

 Key: YARN-456
 URL: https://issues.apache.org/jira/browse/YARN-456
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Bikas Saha
Assignee: Bikas Saha



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-05 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593736#comment-13593736
 ] 

Robert Joseph Evans commented on YARN-378:
--

I don't really want the client config to be called 
yarn.resourcemanager.am.max-retries.  That is a YARN resource manager config, 
and is intended to be used by the RM, not by the map reduce client.  I would 
much rather have a mapreduce.am.max-retries that the MR client reads and uses 
to populate the ApplicationSubmissionContext.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability

 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-05 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593863#comment-13593863
 ] 

Robert Joseph Evans commented on YARN-378:
--

But the config *is* specific to mapreduce.  Every other application client will 
have to provide their own way of putting that value into the container launch 
context.  It could be through a hadoop config or it could be through something 
else entirely.

I am in the process of porting Storm to run on top of YARN.  I don't see us 
ever using a Hadoop Configuration in the client except the default one to be 
able to access HDFS.  Storm has its own configuration object and for better 
integration with Storm I would set up a Storm conf for that, although in 
reality I would probably just never set it because I never want it to go down 
entirely, and that is how I would get the maximum number of retries allowed by 
the cluster.

I can see other applications that already exist and are being ported to run on 
YARN, like OpenMPI, to want to set that config in a way that is consistent with 
their current configuration and not in a Hadoop specific way.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability

 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590842#comment-13590842
 ] 

Robert Joseph Evans commented on YARN-237:
--

localStorage is not per page it is per domain, so that means if two pages in 
the same domain have tables named the same they will share a config in local 
storage.  So for example if I run a map/reduce job and I sort the map tasks by 
elapsed time, the reduce tasks will also be sorted by elapsed time when I go to 
their page.  The good news is that if I sort the reduces by an ID that the maps 
don't know about the maps page just ignores it, but it resets the sorting for 
the reducers not too.

But this produces even stranger behavior in the counters page.  Because the 
counters use a selector, multiple tables on the same page now all share a saved 
state.  So if I sort the counters by a column and then reload all of the 
counters are now sorted by that column.

I am not positive what the best way is to fix these. We want to provide a way 
for each data table to have a unique storage key across all tables in the 
domain, even with the selector.  We don't want to use the page path or anything 
like that because that will create a new group of settings per page, and that 
would result in filling up their localStorage, unless of course we used the 
sessionStorage instead.  But using sessionStorage would mean that each time we 
opened up a new session we would have to re-do the settings.  sessionStorage 
also does not fix the issue with counters and the selector where we have 
multiple tables all sharing a single ID and single storage.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590999#comment-13590999
 ] 

Robert Joseph Evans commented on YARN-237:
--

The code that Jian wrote is working on the RM as well as the AM, I tested it.  
The patch changes code that is common to both of them.  The issues I mentioned 
are not theoretical.  The reason it works on the AM is because it is not using 
a cookie, instead it is using an HTML5 concept for local storage.

If we want to restrict these to just be for the RM that does seem to fix the 
issue.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-02-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589976#comment-13589976
 ] 

Robert Joseph Evans commented on YARN-237:
--

The change looks more or less OK to me.  I am not thrilled about how we modify 
the data table's init string by looking for the first '{', but I think it is 
OK.  I just have a few concerns, and most if it deals with my lack of knowledge 
about jQuery and localStorage.  I know that localStorage is not supported on 
all browsers.  I also know that localStorage can throw a QUOTA_EXCEEDED 
exception.  What happens when we run into these situations?  Will the page stop 
working or will jQuery degrade gracefully and simply not allow us to save the 
data.  What about if the data stored in the key is not what we expect.  Will 
jQuery make the page unusable.  We currently have tables with the same name on 
different pages.  If they are not kept in sync there could be some issues with 
the data that is saved.

Which brings up another point I am also a bit concerned about the key we are 
using as part of the localStorage.  The key is the id of the data table.  I 
would prefer it if we could some how make it obvious that these values are for 
a data table, and not some other apps storage.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-426) Failure to download a public resource on a node prevents further downloads of the resource from that node

2013-02-27 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588424#comment-13588424
 ] 

Robert Joseph Evans commented on YARN-426:
--

The patch looks good to me. +1 I'll check it in.

 Failure to download a public resource on a node prevents further downloads of 
 the resource from that node
 -

 Key: YARN-426
 URL: https://issues.apache.org/jira/browse/YARN-426
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha, 0.23.6
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-426.patch


 If the NM encounters an error while downloading a public resource, it fails 
 to empty the list of request events corresponding to the resource request in 
 {{attempts}}.  If the same public resource is subsequently requested on that 
 node, {{PublicLocalizer.addResource}} will skip the download since it will 
 mistakenly believe a download of that resource is already in progress.  At 
 that point any container that requests the public resource will just hang in 
 the {{LOCALIZING}} state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570329#comment-13570329
 ] 

Robert Joseph Evans commented on YARN-371:
--

Tom just like Arun said the memory usage changes based off of the size of the 
cluster vs. the size of the request.  The current approach is on the order of 
the size of the cluster where as the proposed approach is on the order of the 
number of desired containers.  If I have a 100 node cluster and I am requesting 
10 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if 
reducers are included in it. What is more it is probably exactly the same size 
of request for 1 or even 1000 tasks.  Where as the proposed approach would 
grow without bound as the number of tasks also increased.

However, I also agree with Sandy that the current state compression is lossy 
and as such restricts what is possible in the scheduler. I would like to 
understand better what the size differences would be for various requests, both 
in memory and also over the wire.  It seems conceivable to me that if the size 
difference is not too big, especially over the wire, we could allow the 
scheduler itself to decide on its in memory representation.  This would allow 
for the Capacity Scheduler to keep its current layout and allow for others to 
experiment with more advanced scheduling options.  Different groups could 
decide which scheduler best fits their needs and workload.  If the size is 
significantly larger I would like to see hard numbers about how much 
better/worse it makes specific use cases.

I am also very concerned about adding too much complexity to the scheduler.  We 
have run into issues where the RM will get very far behind in scheduling 
because it is trying to do a lot already and eventually OOM as its event queue 
grows too large. 

I also don't want to change the scheduler protocol too much without first 
understanding how that new protocol would impact other potential scheduling 
features.  There are a number of other computing patterns that could benefit 
from specific scheduler support.  Things like gang scheduling where you need 
all of the containers at once or none of them can make any progress, or where 
you want all of the containers to be physically close to one another because 
they are very I/O intensive, but you don't really care where exactly they are.  
Or even something like HBase where you essentially want one process on every 
single node with no duplicates.  Do the proposed changes make these uses case 
trivially simple, or do they require a lot of support on the AM to implement 
them?

  

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a 

[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570619#comment-13570619
 ] 

Robert Joseph Evans commented on YARN-371:
--

I didn't really expect them to be trivial :). So I think that there may be some 
value in having a different protocol, but we need some hard numbers to be able 
to really make an informed decision.

I would like to see the size of a request in the following table (both in 
memory size on the RM and size sent over the wire)

||nodes(down)/tasks(across)||1,000||10,000||100,000||500,000||
||100|?|?|?|?|
||1,000|?|?|?|?|
||4,000|?|?|?|?|
||10,000|?|?|?|?| 

It would also be great to see in practice how bad is the scheduling problem 
where the wrong node is sent.

 Resource-centric compression in AM-RM protocol limits scheduling
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-225) Proxy Link in RM UI thows NPE in Secure mode

2012-12-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540470#comment-13540470
 ] 

Robert Joseph Evans commented on YARN-225:
--

That does look to be the correct patch assuming that the stack trace was 
against 2.0.2 or before. Either way it is a fix that needs to go in, because I 
misread the HttpServletRequestJavadocs and missed or null if the request has 
no cookies. The fix needs to go into branch-0.23 as well. I am +1 for the fix 
and will check it in. 

 Proxy Link in RM UI thows NPE in Secure mode
 

 Key: YARN-225
 URL: https://issues.apache.org/jira/browse/YARN-225
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha, 2.0.3-alpha
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Critical
 Attachments: YARN-225.patch


 {code:xml}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:241)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:975)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
   at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
   at org.mortbay.jetty.Server.handle(Server.java:326)
   at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
   at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
   at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-293:
-

 Priority: Critical  (was: Major)
 Target Version/s: 0.23.6
Affects Version/s: 0.23.3

It looks like some of the wiring is in place for this.  We just need to send an 
ABORT_LOCALIZATION event when the RM tells the NM the app is done.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Priority: Critical

 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540006#comment-13540006
 ] 

Robert Joseph Evans commented on YARN-293:
--

Sorry looking at it more closely it is actually per container ID, so we need to 
send an event when the container is cleaned up.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Priority: Critical

 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-293:
-

Attachment: YARN-293-trunk.txt

This turned out to be a much smaller change then I originally thought.  I just 
added in the cleanup to a handled that was already being called for all 
containers to delete the container's resources.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Robert Joseph Evans
Priority: Critical
 Attachments: YARN-293-trunk.txt


 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-2) Enhance CS to schedule accounting for both memory and cpu cores

2012-12-27 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540081#comment-13540081
 ] 

Robert Joseph Evans commented on YARN-2:


I chatted with Arun off line a bit about this, and he pointed out to me that 
the APIs are marked as Evolving, I should read the patch more closely next 
time.  So I am OK with putting it in with the API as it is.  I still think that 
having a float for the API is preferable, but until we actually start using it 
in practice we will not know what the real issues are.

 Enhance CS to schedule accounting for both memory and cpu cores
 ---

 Key: YARN-2
 URL: https://issues.apache.org/jira/browse/YARN-2
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, scheduler
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 2.0.3-alpha

 Attachments: MAPREDUCE-4327.patch, MAPREDUCE-4327.patch, 
 MAPREDUCE-4327.patch, MAPREDUCE-4327-v2.patch, MAPREDUCE-4327-v3.patch, 
 MAPREDUCE-4327-v4.patch, MAPREDUCE-4327-v5.patch, YARN-2-help.patch, 
 YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, 
 YARN-2.patch


 With YARN being a general purpose system, it would be useful for several 
 applications (MPI et al) to specify not just memory but also CPU (cores) for 
 their resource requirements. Thus, it would be useful to the 
 CapacityScheduler to account for both.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2012-12-20 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537168#comment-13537168
 ] 

Robert Joseph Evans commented on YARN-276:
--

I am not an expert on the scheduler code, so I have not done an in depth review 
of the patch.  My biggest concern with this is that there is no visibility in 
the UI/web services about why an app may not have been scheduled.  It would be 
great if you could update CapacitySchedulerLeafQueueInfo.java and the web page 
that uses it CapacitySchedulerPage.java.

 Capacity Scheduler can hang when submit many jobs concurrently
 --

 Key: YARN-276
 URL: https://issues.apache.org/jira/browse/YARN-276
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0, 2.0.1-alpha
Reporter: nemon lou
 Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
 YARN-276.patch, YARN-276.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
 scheduler can hang with most resources taken up by AM and don't have enough 
 resources for tasks.And then all applications hang there.
 The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not 
 check directly.Instead ,this property only used for maxActiveApplications. 
 And maxActiveApplications is computed by minimumAllocation (not by Am 
 actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-204) test coverage for org.apache.hadoop.tools

2012-11-27 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504710#comment-13504710
 ] 

Robert Joseph Evans commented on YARN-204:
--

+1 the new changes look better. I'll check this in.

 test coverage for org.apache.hadoop.tools
 -

 Key: YARN-204
 URL: https://issues.apache.org/jira/browse/YARN-204
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-204-branch-0.23-a.patch, 
 YARN-204-branch-0.23-b.patch, YARN-204-branch-0.23.patch, 
 YARN-204-branch-2-a.patch, YARN-204-branch-2-b.patch, 
 YARN-204-branch-2.patch, YARN-204-trunk-a.patch, YARN-204-trunk-b.patch, 
 YARN-204-trunk.patch


 Added some tests for org.apache.hadoop.tools

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2012-11-26 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503890#comment-13503890
 ] 

Robert Joseph Evans commented on YARN-237:
--

You have to be careful with cookies because the web app proxy strips out 
cookies before sending the data to the application.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash

 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >