from:"Robert Joseph Evans $JIRA$"

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-04-05 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226204#comment-15226204
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I am not suggesting there is a DNS based solution.  I am not a DNS expert and 
was hopeful there could at least be a DNS based mitigation possible, but that 
hope has now faded.

I wanted to bring it up for discussion as part of the design so we go into this 
with our eyes wide open, and that at a minimum documenting it with examples for 
"fixing" it becomes a part of the final product.  That did not happen for the 
initial registry service, but probably should have.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-04-05 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226196#comment-15226196
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I think that is through the naming convention, and the DNS configuration on the 
desktop in a foreign county.

I imagine that if I were doing this I would set it up so that the Hadoop DNS 
server would handle a set of sub-domains in my companies internal DNS setup.  
Then when my desktop is setup, or when my laptop connects to the VPN, the DNS 
server that it talks to would be configured to include one that also knows 
about the Hadoop setup.

But that is just my guess.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-28 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214258#comment-15214258
 ] 

Robert Joseph Evans commented on YARN-4757:
---

There are lots of ways to "fix" these issues on a case by case basis.  I mostly 
want to be sure that any documentation around YARN and service discovery is 
very clear that there are inherent races that can happen on shared 
infrastructure.  YARN/Slider cannot fix them for end users and any client 
talking to a secure application/server should validate that the server is the 
correct and expected server.  Concrete examples of how to do this would be 
great.  This is not a new issue.  It has existed since the registry service was 
first implemented.  We are simply making it much easier for a user to integrate 
off the shelf components that are coming from a more traditional 
infrastructure/deployment where this is not necessarily a concern.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-24 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210513#comment-15210513
 ] 

Robert Joseph Evans commented on YARN-4757:
---

My concern about mutual authentication is mostly around documentation and 
asking if there is anything we can do to mitigate possible issues/attacks.  
Instead of talking about exact attacks lets talk about a few accidents that 
could happen, and can happen today, but are less likely because when I update 
my client to use the Registry API I make different assumptions about things.

Lets say I am running a web service on YARN, and I want my customers to be able 
to get to me in through existing tools.  So I set this all up and I have them 
go to http://api.bobby.yarncluster.myCompany.com:/ (or something else that 
matches the naming convention you had, I don't remember exactly and it is not 
relevant)  First of all I have no way to guarantee that  is open on any 
node, so it is a pain but I try to launch several web servers, finally get a 
few to come up, the others fail and get relaunched on other nodes eventually 
they are all up and running, and the AM puts all of them into the registry 
service.

Things are going well.  Customers are running using my service and everyone is 
happy.  But then I do a rolling upgrade and I kill one container and launch a 
new one on another box.  In the mean time some other container on the box I was 
running on grabs port  and brings up an internal web UI for it.  Now many 
of my customers trying to hit my web service but get this other process and see 
404 errors, etc.  Because DNS is eventual consistency, and there is a lot of 
caching happening, there is a race.  If the client does not authenticate the 
server, like with https, then someone malicious could exploit this to do all 
kinds of things.

I am simply saying that many people trust DNS a lot more than they should in 
their protocols, more so when they feel that they have DNSSEC turned on 
internally and they are going to an internal address that they can "trust".  By 
exposing YARN through DNS it did not make it any less secure, it just made it a 
lot simpler for someone to deploy something that is insecure.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-24 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210465#comment-15210465
 ] 

Robert Joseph Evans commented on YARN-4757:
---

[~jmaron],

Thanks for the answers.

As for the SRV records and which IP address is returned it might be good to 
make it more clear in the document what you are proposing. 

Security makes since, it is almost exactly the same as what we do for storm (I 
really wish ZK had delegation tokens though).

My concern was not about do we have the ability to return multiple addresses.  
My concern was mostly about how many can we return.  Typically the address 
returned for google.com, etc are actually pointing to a very large load 
balancer and not individual web servers.  So the number of entries returned is 
on the order of the number of data centers someone has, or even more likely it 
is even higher level and it is around the number of geographic regions.  At 
Yahoo we run very large HBase clusters.  I'm not sure how well tools would 
handle getting back 2000 IP addressed for a record.  I mostly want to 
understand what if any theoretical limits there are to this technology and what 
if any practical limits there are.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209067#comment-15209067
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I also did a quick pass through the document and I wanted to clarify a few 
things.

So in some places in the document, like with names that map to containers and 
names that map to components it says something like "If Available" indicating 
that if an IP address is not assigned to the individual container no mapping 
will be made.  Am I interpreting that correctly? Are there situations where you 
would just return the IP Address of the node the container is running on? Am I 
just mistaken in my interpretation and there are different situations where we 
could launch a container that would have no IP address available.

However for the per application records there is no such conditional.  Does 
that mean that we will return records for any service API no matter how the IP 
Addresses are assigned, or there is no way for the IP Address to not be 
available?

Also I am not super familiar with the slider registry so perhaps you could 
clarify a few things there too.

How is authentication with zookeeper handled?  Is it always SASL+kerberos?  
Just because the doc mentions that the RM has to set up the base user directory 
with permissions.  Would then any secure slider app that wants to use the 
registry be required to ship a keytab with their application?

Also I am not super familiar with the existing registry API, from the example 
in the doc it shows a few different types of services that an Application 
Master can register.  Both Host/Port and URI.  Would we be exposing SRV records 
for both of these combinations?  If so how would they be named?

I am also curious about limits to various DNS fields both in the protocol and 
in practice with common implementations.  I am not an expert on DNS so if I say 
something silly after you stop laughing please let me know.  The document talks 
a lot about doing character remapping and having to have unique application 
names, but it does not talk about limits to the lengths of those names (I have 
seen some DNS servers don't support more then 254 character names).  What about 
limits on the number of IP addresses that can be returned for a given name.  I 
could not find anything specific but I have to assume that in practice most 
systems don't support a huge number of these, and large clusters on YARN can 
easily launch hundreds or even thousands of containers for a given service.

In addition to Allen's concerns the document does not seem to address/call out 
my initial concerns about requiring mutual authentication, or handling of port 
availability in scheduling.



> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-14 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193769#comment-15193769
 ] 

Robert Joseph Evans commented on YARN-4757:
---

[~aw], I am not expert on DNS so it is good to hear that you have thought 
through this and done your homework.  I read up a little on SRV records and it 
looks like a good fit.  It still does not change the need for 2 way 
authentication and making sure that we can restrict who registers for a 
service, but because SRV records are not a drop in replacement for A/CNAME 
records it should not be as big of an issue.

Clients are likely going to need to make changes to support SRV records, and 
from what I can tell java does not come with built in support, not the end of 
the world, but also likely non-trivial.  Especially when it looks like the 
industry has not decided on how they want to support http. (Although I could be 
wrong on all of that, because like I said I am not an expert here)

I just want to be sure that you are thinking things through, and it looks like 
you are so I am happy.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-14 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193577#comment-15193577
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I am +1 on the idea of using DNS for long lived service discovery, but we need 
to be very very careful about security.  If we are not all of the problems 
possible with https://en.wikipedia.org/wiki/DNS_spoofing would likely be 
possible with this too.  We need to be positive that we can restrict the names 
allowed so there are no conflicts with other servers on the network/internet.  
Additionally if we make this super simple, which is the entire goal here, then 
we are covering up some really potentially serious issues with client code, 
that a normal server running off YARN would not expect to have.  It really 
comes down to any service running on YARN that wants to be secure needs to have 
2 way authentication client authenticates server and server authenticates 
clients.  There are timing attacks and other things that can happen when a 
process crashes and lets go of a port.  Internal web services especially feel 
vulnerable because unless you enable SSL they will be insecure, something that 
many groups avoid on internal services because of the extra overhead of doing 
encryption.

Do you plan on handling ephemeral ports in some way? As far as I know there is 
no standard for including port(s) in a DNS entry.  If we do come up with 
something that is non-standard doesn't that still necessitate client side 
changes which was an expressed goal of this JIRA?  If we don't handle ephemeral 
ports are we going to add in mesos-like scheduling of ports?

  

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3605) _ as method name may not be supported much longer

2015-05-18 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548375#comment-14548375
 ] 

Robert Joseph Evans commented on YARN-3605:
---

This is not a newbie issue.  The code that has the _ method in it is generated 
code, and the code that generates it is far from simple.  This is also 
technically a backwards incompatible change, because other YARN applications 
could be using it.

 _ as method name may not be supported much longer
 -

 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans

 I was trying to run the precommit test on my mac under JDK8, and I got the 
 following error related to javadocs.
  
  (use of '_' as an identifier might not be supported in releases after Java 
 SE 8)
 It looks like we need to at least change the method name to not be '_' any 
 more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3605) _ as method name may not be supported much longer

2015-05-18 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3605:
--
Labels:   (was: newbie)

 _ as method name may not be supported much longer
 -

 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans

 I was trying to run the precommit test on my mac under JDK8, and I got the 
 following error related to javadocs.
  
  (use of '_' as an identifier might not be supported in releases after Java 
 SE 8)
 It looks like we need to at least change the method name to not be '_' any 
 more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer

2015-05-08 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-644:
-
Labels:   (was: BB2015-05-RFC)

 Basic null check is not performed on passed in arguments before using them in 
 ContainerManagerImpl.startContainer
 -

 Key: YARN-644
 URL: https://issues.apache.org/jira/browse/YARN-644
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Omkar Vinit Joshi
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-644.001.patch, YARN-644.002.patch, 
 YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch


 I see that validation/ null check is not performed on passed in parameters. 
 Ex. tokenId.getContainerID().getApplicationAttemptId() inside 
 ContainerManagerImpl.authorizeRequest()
 I guess we should add these checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3605) _ as method name may not be supported much longer

2015-05-08 Thread Robert Joseph Evans (JIRA)

Robert Joseph Evans created YARN-3605:
-

 Summary: _ as method name may not be supported much longer
 Key: YARN-3605
 URL: https://issues.apache.org/jira/browse/YARN-3605
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Robert Joseph Evans


I was trying to run the precommit test on my mac under JDK8, and I got the 
following error related to javadocs.
 
 (use of '_' as an identifier might not be supported in releases after Java SE 
8)

It looks like we need to at least change the method name to not be '_' any 
more, or possibly replace the HTML generation with something more standard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer

2015-05-08 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534803#comment-14534803
 ] 

Robert Joseph Evans commented on YARN-644:
--

Thanks [~varun_saxena], I agree with [~gtCarrera9] +1.  I'll check this in.

 Basic null check is not performed on passed in arguments before using them in 
 ContainerManagerImpl.startContainer
 -

 Key: YARN-644
 URL: https://issues.apache.org/jira/browse/YARN-644
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Omkar Vinit Joshi
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-644.001.patch, YARN-644.002.patch, 
 YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch


 I see that validation/ null check is not performed on passed in parameters. 
 Ex. tokenId.getContainerID().getApplicationAttemptId() inside 
 ContainerManagerImpl.authorizeRequest()
 I guess we should add these checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534957#comment-14534957
 ] 

Robert Joseph Evans commented on YARN-3148:
---

The changes look fine to me.  Not sure why the patch could not apply.  The 
queue is full right now, so I will try to run things manually.

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3148:
--
Labels:   (was: BB2015-05-RFC)

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-08 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-3148:
--
Labels: BB2015-05-RFC  (was: )

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-11 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058745#comment-14058745
 ] 

Robert Joseph Evans commented on YARN-2261:
---

+1 either approach seems fine to me.  Vinod's requires an opt in, which is nice 
from a backwards compatibility standpoint.  Also do we want to count the 
cleanup container as a running application?  We definitely need to count its 
resources against any queue it is a part of, but for a queue that is configured 
to run mostly large applications, it could have other applications back up 
behind the cleanup containers.

 YARN should have a way to run post-application cleanup
 --

 Key: YARN-2261
 URL: https://issues.apache.org/jira/browse/YARN-2261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 See MAPREDUCE-5956 for context. Specific options are at 
 https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup

2014-07-11 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059060#comment-14059060
 ] 

Robert Joseph Evans commented on YARN-2261:
---

Yes and that is not necessarily a good thing.  Especially if cleanup can take a 
relatively long period of time.

 YARN should have a way to run post-application cleanup
 --

 Key: YARN-2261
 URL: https://issues.apache.org/jira/browse/YARN-2261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 See MAPREDUCE-5956 for context. Specific options are at 
 https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-08 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055588#comment-14055588
]

Robert Joseph Evans commented on YARN-611:
--

Why is the reset policy created on a per app ATTEMPT basis? Shouldn't it be on
a per application basis. Wouldn't having more then one
WindowsSlideAMRetryCountResetPolicy per application be a waste as they will
either be running in parallel racing with each other, or there will be extra
overhead to stop and start them for each application attempt?

Inside WindowsSlideAMRetryCountResetPolicy you create a new Timer. Timer
instances create a new thread, I am not sure we really need a new thread for
potentially each application, just so the thread can wakeup every few seconds
to reset a counter.

Inside WindowsSlideAMRetryCountResetPolicy.amRetryCountReset we call
rmApp.getCurrentAppAttempt() in a loop. Why don't we cache it?

I also don't really like how the code handles locking. To me it always feels
bad to hold a lock while calling into a class that may call back into you,
especially from a different thread. The WindowsSlideAMRetryCountResetPolicy
calls into getAppAttemptId, shouldCountTowardsMaxAttemptRetry,
mayBeLastAttempt, and setMaybeLastAttemptFlag of RmAppAttemptImpl.
RmAppAttemptImpl calls into start, stop, and recover for the resetPolicy.
Right now I don't think there are any potential deadlocks because
RmAppAttemptImpl never holds a lock while interacting directly with
resetPolicy, but if it ever does then it could deadlock. I'm not sure of a
good way to fix this, except perhaps through comments in the ResetPolicy
interface specifying that start/stop/recover will never be called while holding
a lock for RMAppAttempt or RMApp.

Add an AM retry count reset window to YARN RM
-

Key: YARN-611
URL: https://issues.apache.org/jira/browse/YARN-611
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch

YARN currently has the following config:
yarn.resourcemanager.am.max-retries
This config defaults to 2, and defines how many times to retry a failed AM
before failing the whole YARN job. YARN counts an AM as failed if the node
that it was running on dies (the NM will timeout, which counts as a failure
for the AM), or if the AM dies.
This configuration is insufficient for long running (or infinitely running)
YARN jobs, since the machine (or NM) that the AM is running on will
eventually need to be restarted (or the machine/NM will fail). In such an
event, the AM has not done anything wrong, but this is counted as a failure
by the RM. Since the retry count for the AM is never reset, eventually, at
some point, the number of machine/NM failures will result in the AM failure
count going above the configured value for
yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the
job as failed, and shut it down. This behavior is not ideal.
I propose that we add a second configuration:
yarn.resourcemanager.am.retry-count-window-ms
This configuration would define a window of time that would define when an AM
is well behaved, and it's safe to reset its failure count back to zero.
Every time an AM fails the RmAppImpl would check the last time that the AM
failed. If the last failure was less than retry-count-window-ms ago, and the
new failure count is max-retries, then the job should fail. If the AM has
never failed, the retry count is max-retries, or if the last failure was
OUTSIDE the retry-count-window-ms, then the job should be restarted.
Additionally, if the last failure was outside the retry-count-window-ms, then
the failure count should be set back to 0.
This would give developers a way to have well-behaved AMs run forever, while
still failing mis-behaving AMs after a short period of time.
I think the work to be done here is to change the RmAppImpl to actually look
at app.attempts, and see if there have been more than max-retries failures in
the last retry-count-window-ms milliseconds. If there have, then the job
should fail, if not, then the job should go forward. Additionally, we might
also need to add an endTime in either RMAppAttemptImpl or
RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the
failure.
Thoughts?

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-07 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054168#comment-14054168
]

Robert Joseph Evans commented on YARN-611:
--

Why are you using java serialization for the retry policy? There are too many
problems with java serialization, especially if we are persisting it into a DB,
like the state store. Please switch to using something like protocol buffers
that will allow for forward/backward compatible modifications going forward.

in the javadocs for RMApp.setRetryCount it would be good to explain what retry
count actually is and does.

In the constructor for RMAppAttemptImpl there is special logic to call setup
only for WindowsSlideAMRetryCountResetPolicy. This completely loses the
abstraction that the AMResetCountPolicy interface should be providing. Please
update the interface so that you don't need special case code for a single
implementation.

In RMAppAttemptImpl you mark setMaybeLastAttemptFlag as Private this really
should have been done in the parent interface. In the parent interface you also
add in myBeLastAttempt() this too should be marked as Private and both of them
should have comments that these are for testing.

Add an AM retry count reset window to YARN RM
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers

2014-06-13 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030643#comment-14030643
]

Robert Joseph Evans commented on YARN-2140:
---

We are working on similar things for storm. I am very interested in your
design, because for any streaming system to truly have a chance on YARN soft
guarantees on network I/O are critical. There are several big problems with
network I/O even if the user can effectively estimate what they will need. The
first is that the resource is not limited to a single node in the cluster. The
network has a topology and a bottlekneck can show up at any point in that
topology. So you may think you are fine because each node in a rack is not
scheduled to be using the full bandwidth that the network card(s) can support.
But you can easily have saturated the top of rack switch without knowing it.
To solve this problem you effectively have to know the topology of the
application itself. So that you can schedule the node to node network
connections within that application. if users don't know how much network they
are going to use at a high level, they will never have any idea at a low level.
But then you also have the big problem of batch being very bursty in its
network usage. The only way to solve this is going to require network hardware
support for prioritizing packets.

But I'll wait for your design before writing too much more.

Add support for network IO isolation/scheduling for containers
--

Key: YARN-2140
URL: https://issues.apache.org/jira/browse/YARN-2140
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-01-15 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872300#comment-13872300
 ] 

Robert Joseph Evans commented on YARN-1530:
---

I agree that we need to think about load and plan for something that can handle 
at least 20x the current load but preferably 100x.  However, I am not that sure 
that the load will be a huge problem at least for current MR clusters.  We have 
seen very large jobs as well, but 700 MB history file job does not finish 
instantly.  I took a look at a 3500 node cluster we have that is under fairly 
heavy load, and looking at the done directory for yesterday, I saw what 
amounted to about 1.7MB/sec of data on average.  Gigabit Ethernet should be 
able to handle 15 to 20 times this (assuming that we read as much data as we 
write, and that the storage may require some replication).

I am fine with the proposed solution by [~lohit] so long as the history service 
always provides a restful interface and the AM can decide if it wants to use 
it, or go through a different higher load channel.  Otherwise non-java based 
AMs would not necessarily be able to write to the history service.

I am also a bit nervous about using the history service for recovery or as a 
backend for the current MR APIs if we have a pub/sub system as a link between 
the applications and the history service.  I don't think it is a show stopper, 
it just opens the door for a number of corner cases that will have to be dealt 
with, like an MR AM crashes badly and the client goes to the history service to 
get the counters/etc, when does the history service know that all of the events 
for the MR AM have been processed so it can return those counters, or perhaps 
other data?  I am not totally sure what data may be a show stopper for this, 
but the lag means all applications have to be sure that they don't use the 
history service for split brain problems or things like that.

 [Umbrella] Store, manage and serve per-framework application-timeline data
 --

 Key: YARN-1530
 URL: https://issues.apache.org/jira/browse/YARN-1530
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
 Attachments: application timeline design-20140108.pdf


 This is a sibling JIRA for YARN-321.
 Today, each application/framework has to do store, and serve per-framework 
 data all by itself as YARN doesn't have a common solution. This JIRA attempts 
 to solve the storage, management and serving of per-framework data from 
 various applications, both running and finished. The aim is to change YARN to 
 collect and store data in a generic manner with plugin points for frameworks 
 to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-321) Generic application history service

2013-12-23 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855707#comment-13855707
]

Robert Joseph Evans commented on YARN-321:
--

The way it currently works is based off of group permissions on a directory
(this is from memory from a while ago so I could be off on a few things). In
HDFS when you create a file the group of the file is the group of the directory
the file is a part of, similar to the sticky bit on a directory in Linux. When
an MR job completes it will copy it's history log file, along with a few other
files, to a drop box like location called intermediate done and atomically
rename it from a temp name to the final name. The directory is world writable,
but only readable by a special group that the history server is a part of, but
general users are not. The history server then wakes up periodically and will
scan that directory for new files, when it sees new files it will move them to
a final location that is owned by the headless history server user. If a query
comes in for a job that the history server is not aware of, it will also scan
the intermediate done directory before failing.

Reading history data is done through RPC to the history server, or through the
web interface, including RESTful APIs. There is no supported way for an app to
read history data directly though the file system. I hope this helps.

Generic application history service
---

Key: YARN-321
URL: https://issues.apache.org/jira/browse/YARN-321
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf,
Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java

The mapreduce job history server currently needs to be deployed as a trusted
server in sync with the mapreduce runtime. Every new application would need a
similar application history server. Having to deploy O(T*V) (where T is
number of type of application, V is number of version of application) trusted
servers is clearly not scalable.
Job history storage handling itself is pretty generic: move the logs and
history data into a particular directory for later serving. Job history data
is already stored as json (or binary avro). I propose that we create only one
trusted application history server, which can have a generic UI (display json
as a tree of strings) as well. Specific application/version can deploy
untrusted webapps (a la AMs) to query the application history server and
interpret the json for its specific UI and/or analytics.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-10-25 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805618#comment-13805618
 ] 

Robert Joseph Evans commented on YARN-941:
--

That sounds like a great default.  I would like to also have a way for an AM to 
say I can handle updating tokens without being shot, but that may be something 
that shows up in a follow on JIRA.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-321) Generic application history service

2013-10-09 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790555#comment-13790555
]

Robert Joseph Evans commented on YARN-321:
--

I like the diagrams, but I want to understand if the generic application
history service is intended to replace the job history server, or to just
augment it?

I would prefer it if we could replace the current server. Perhaps not in the
first release, but eventually. To make that work we would have to provide a
way for MR specific code to come up and run inside the service, exposing both
the current restful web service, an application specific UI, and the RPC server
that we currently run.

Generic application history service
---

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-09 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-913:
-

Attachment: RegistrationServiceDetails.txt

Uploading a file that shows some examples of the registration service APIs.  
Any feedback on them is appreciated.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans
 Attachments: RegistrationServiceDetails.txt


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Assigned] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-04 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-913:


Assignee: Robert Joseph Evans

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans

 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2013-10-04 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786299#comment-13786299
]

Robert Joseph Evans commented on YARN-913:
--

Yes it does have plenty of races. I'll try to get some detailed designs up
shortly but at a high level the general idea is to have a restful web service.
For the most common use case there just needs to be two interfaces.

- Register/Monitor a Service
- Query for Services

Part of the reason we need the service registry is to securely verify that a
client is talking to the real service, and no one has grabbed the service's
port after it registered. To do that I want to have the concept of a verified
service. For that we would need an admin interface for adding, updating, and
removing verified services.

The registry would provide a number of pluggable ways for services to
authenticate. Part of adding a verified service would include indicating which
authentication models the service can use to register and which users are
allowed to register that service.

The registry could also act like a trusted Certificate Authority. Another part
of adding in a verified service would include indicating how clients could
verify they are talking to the true service. This could include just
publishing an application id so the client can go to the RM and get a
delegation token. Another option would be having the service generate a
public/private key pair. When the service registers it would get the private
key and the public key would be available through the discovery interface.

The plan is to also have the registry monitor the service similar to ZK. The
service would heartbeat in to the registry periodically (could be on the order
of mins depending on the service) after a certain period of time of inactivity
the service would be removed from the registry. Perhaps we should add in an
explicit unregister as well.

I want to make sure that the data model it is generic enough that we could
support something like a web service on the gird where each server can register
itself and all of them would show up in the registry, so a service could have
one or more servers that are a part of it, and each server could have some
separate metadata about it.

I also want to have a plug-in interface for discovery, so we could potentially
make the registry look like a DNS server or an SSL Certificate Authority which
would make compatibility with existing applications and clients a lot simpler.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Robert Joseph Evans

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-10-03 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785556#comment-13785556
 ] 

Robert Joseph Evans commented on YARN-624:
--

[~curino] Sorry about the late reply.  I have not really tested this much with 
storm on YARN.  Most of our experiments it is negligible the amount of time it 
takes to get nodes.  But we have not really done anything serious with it, and 
adding new nodes right now is a manual operation.

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-30 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754787#comment-13754787
 ] 

Robert Joseph Evans commented on YARN-896:
--

I agree that providing a good way handle stdout and stderr is important. I 
don't know if I want the NM to be doing this for us though, but that is an 
implementation detail that we can talk about on the follow up JIRA.  Chris, 
feel free to file a JIRA for rolling of stdout and stderr and we can look into 
what it will take to support that properly.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-19 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743819#comment-13743819
 ] 

Robert Joseph Evans commented on YARN-896:
--

[~criccomini],

That is a great point.  To do this we need the application to somehow inform 
YARN that it is a long lived application.  We could do this either through some 
sort of metadata that is submitted with the application to YARN, possibly 
through the service registry, or even perhaps just setting the progress to a 
special value like -1.  I think I would prefer the first one, because then YARN 
could use that metadata later on for other things.  After that the UI change 
should not be too difficult.  If you want to file a JIRA for it, either as a 
sub task or just link it in, that would be great.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2013-08-15 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741170#comment-13741170
]

Robert Joseph Evans commented on YARN-810:
--

Sorry I am a bit late to this discussion. I don't like the config to be
global. I think it needs to be on a per container basis.

{quote}There are certain cases where this is desirable. There are also certain
cases where it might be desirable to have a hard limit on CPU usage, and not
allow the process to go above the specified resource requirement, even if it's
available.{quote}

The question is are there ever two different applications running on the same
cluster where it is desirable for one, and not for the other. I believe that
is true. I argued this in YARN-102 where you want to measure how long an
application will take to run under a specific CPU resource request. If I allow
it to go over I will never know how long it would take worst case, and so I
will never know if my config is correct unless I can artificially limit it.
But in production I don't want to run worst case every time, and I don't want a
special test cluster to see what the worst case is.

Support CGroup ceiling enforcement on CPU
-

Key: YARN-810
URL: https://issues.apache.org/jira/browse/YARN-810
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza

Problem statement:
YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio.
Containers are then allowed to request vcores between the minimum and maximum
defined in the yarn-site.xml.
In the case where a single-threaded container requests 1 vcore, with a
pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of
the core it's using, provided that no other container is also using it. This
happens, even though the only guarantee that YARN/CGroups is making is that
the container will get at least 1/4th of the core.
If a second container then comes along, the second container can take
resources from the first, provided that the first container is still getting
at least its fair share (1/4th).
There are certain cases where this is desirable. There are also certain cases
where it might be desirable to have a hard limit on CPU usage, and not allow
the process to go above the specified resource requirement, even if it's
available.
Here's an RFC that describes the problem in more detail:
http://lwn.net/Articles/336127/
Solution:
As it happens, when CFS is used in combination with CGroups, you can enforce
a ceiling using two files in cgroups:
{noformat}
cpu.cfs_quota_us
cpu.cfs_period_us
{noformat}
The usage of these two files is documented in more detail here:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
Testing:
I have tested YARN CGroups using the 2.0.5-alpha implementation. By default,
it behaves as described above (it is a soft cap, and allows containers to use
more than they asked for). I then tested CFS CPU quotas manually with YARN.
First, you can see that CFS is in use in the CGroup, based on the file names:
{noformat}
[criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
total 0
-r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
-r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
-rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
-rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
10
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
-1
{noformat}
Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
We can place processes in hard limits. I have process 4370 running YARN
container container_1371141151815_0003_01_03 on a host. By default, it's
running at ~300% cpu usage.
{noformat}
CPU
4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ...
{noformat}
When I set the CFS quote:
{noformat}
echo 1000
/cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
CPU
4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39

[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-15 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741518#comment-13741518
 ] 

Robert Joseph Evans commented on YARN-1024:
---

I am fine with that too.

 Define a virtual core unambigiously
 ---

 Key: YARN-1024
 URL: https://issues.apache.org/jira/browse/YARN-1024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We need to clearly define the meaning of a virtual core unambiguously so that 
 it's easy to migrate applications between clusters.
 For e.g. here is Amazon EC2 definition of ECU: 
 http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
 Essentially we need to clearly define a YARN Virtual Core (YVC).
 Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
 equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-14 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739853#comment-13739853
]

Robert Joseph Evans commented on YARN-1024:
---

{quote}Sorry for the longwindedness.{quote}

From what people have told me you still have a long ways to go before you
approach me for longwindedness :).

My initial gut reaction is that only having two numbers to express the request
seems too simplified, but the more I think about it the more I am OK with it,
although I think I would change the numbers to be total YCUs requested and
minimum YCUs per core. This gives the user better viability into how the
scheduler is treating these numbers so they can better reason about them. The
total YCUs is the value used for scheduling. The minimum YCUs per core is
compared to the maxComputeUnitsPerCore like was suggested to reject a request
as not possible, or in the case of a heterogeneous environment restrict the
hosts that this container can run on. Although I am OK with the original
proposal too.

I would also like us to have a flag that would either limit the container to
the requested CPU and let it have no more even when more is available, or would
let it expand to use whatever CPU was free, but would be guaranteed to get at
least the YCUs requested. This is likely something that would have to be done
on a separate JIRA though. Without this I don't see a way to really get
simplicity, predictability, or consistency. 1 MB of RAM is fairly simple to
understand. It can be measured without too much of a problem just by running
the process. Most user do a simple search for the correct value run with the
default, if it does not work I increase the amount and run again. 1 YCU is
very complex to measure for an application. If I cannot restrict a container
to never use more than what was requested I cannot consistently predict how
long it will take to run later. Without this I don't know how to answer the
question I know will come up.

What should I set these values to?

Define a virtual core unambigiously
---

Key: YARN-1024
URL: https://issues.apache.org/jira/browse/YARN-1024
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

We need to clearly define the meaning of a virtual core unambiguously so that
it's easy to migrate applications between clusters.
For e.g. here is Amazon EC2 definition of ECU:
http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
Essentially we need to clearly define a YARN Virtual Core (YVC).
Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the
equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-08-12 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736894#comment-13736894
]

Robert Joseph Evans commented on YARN-624:
--

From my perspective it does not really solve the problem for me. It comes
close but is not perfect. I am interested in gang scheduling to support
[storm on yarn|https://github.com/yahoo/storm-yarn/]

The biggest issue I have with this design is knowing the size before the
application is launched. The ultimate goal with storm is to have a system
where multiple separate, but related, storm topologies are processing data
using the same application. We would configure the queues so that if storm
sees a spike in demand it can steal containers from batch processing to grow a
topology and when the spike goes away it would release those containers back.
If the number of containers changes dynamically, by both submitting new
topologies and growing/shrinking existing ones it is impossible to tell YARN
what I need at the beginning.

Gang scheduling is interesting for me because there is a specific number of
containers that each topology is configured to need when that topology is
launched. Without all of those containers there is no reason to launch a
single part of the topology. I can see this happening with a modification to
your approach where the all or nothing happens when the AM submits a request,
and not when the AM is submitted.

I also have a hard time seeing how this would work well with other advanced
features like preemption. For preemption to work well with gang scheduling it
needs to take into account that if it shoots a container in a gang of
containers it is likely going to get back a lot more resources then just one
container. If it is aware of this then it can still shoot the container, but
avoid shooting other containers needlessly because it knows what it is going to
get back.

Support gang scheduling in the AM RM protocol
-

Key: YARN-624
URL: https://issues.apache.org/jira/browse/YARN-624
Project: Hadoop YARN
Issue Type: Sub-task
Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

Per discussion on YARN-392 and elsewhere, gang scheduling, in which a
scheduler runs a set of tasks when they can all be run at the same time,
would be a useful feature for YARN schedulers to support.
Currently, AMs can approximate this by holding on to containers until they
get all the ones they need. However, this lends itself to deadlocks when
different AMs are waiting on the same containers.

[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-12 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736919#comment-13736919
]

Robert Joseph Evans commented on YARN-1024:
---

Perhaps I am missing something here. The goals Arun has asked for are
simplicity, predictability, and consistency. Simplicity I totally agree with,
but I do not totally agree with always having predictability and consistency
after simplicity, and I do not agree that they are always required. These two
come with a trade-off with utilization, and this is something that Sandy
brought up, although not directly. For HBase guaranteed resources, in terms of
both parallelism and raw CPU speed are important because it is using those to
provide a service where predictability and consistency are needed. If the HBase
AM cannot truly express to YARN what it needs because of simplicity HBase on
YARN will not be used, because it will not behave the way users need/expect it
to. Similarly if HBase is allowed to steal resources from others you can
easily request too little resources on an underutilized cluster and when the
cluster is under load it falls apart.

This is similar for me with my desire for Storm on YARN. I am happy to use a
complex API to express my needs if it means that I get what I need. On the
other hand, if I am doing MR batch processing most of the time (but not all of
it) I am doing single threaded processing and I really just want it to fill in
the gaps and use as much unused CPU as it can. Yes, some MR jobs have strict
SLAs but most do not and it is best if we can provide a scheduler that can
balance both.

I also don't agree that because YARN lacks the ability to schedule everything
that impacts performance, including network and disk IO, that we should skip
doing CPU correctly. Some applications are truly CPU bound and they will
benefit. For other resources we can add them to YARN as they are needed until
we do meet the goal of predictability and consistency.

Define a virtual core unambigiously
---

Key: YARN-1024
URL: https://issues.apache.org/jira/browse/YARN-1024
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-08-09 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734983#comment-13734983
]

Robert Joseph Evans commented on YARN-896:
--

Sorry I have not responded sooner. I have been out on vacation and had a high
severity issue that has consumed a lot of my time.

[~lmccay] and [~thw] There are many different services that long lived
processes need to communicate with. Many of these services use tokens and
others may not. Each of these tokens or other credentials are specific to the
services being accessed. In some cases like with HBase we probably can take
advantage of the existing renewal feature in the RM. With other tokens or
credentials it may be different, and may require AM specific support for them.
I am not really that concerned with solving the renewal problem for all
possible credentials here, although if we can solve this for a lot of common
tokens at the same time that would be great. What I care most about is being
sure that a long lived YARN application does not necessarily have to stop and
restart because an HDFS token cannot be renewed any longer. If there are
changes going into the HDFS security model that would make YARN-941 unnecessary
that is great. I have not had much time to follow the security discussion so
thank you for pointing this out. But it is also a question of time frames.
YARN-941 and YARN-1041 would allow for secure, robust, long lived applications
on YARN, and do not appear to be that difficult to accomplish. Do you know the
time frame for the security rework?

Roll up for long lived YARN
---

Key: YARN-896
URL: https://issues.apache.org/jira/browse/YARN-896
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Robert Joseph Evans

YARN is intended to be general purpose, but it is missing some features to be
able to truly support long lived applications and long lived containers.
This ticket is intended to
# discuss what is needed to support long lived processes
# track the resulting JIRA.

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-19 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713692#comment-13713692
 ] 

Robert Joseph Evans commented on YARN-896:
--

[~thw] I am not totally sure what you mean by app specific tokens.  Is this 
tokens that the app is going to use to connect to other services like HBase? or 
is it something else?

[~eric14] and [~enis] Rolling upgrades is a very interesting use case.  We can 
definitely add in a ticket to support this type of thing.  I agree that it 
needs to be thought through some, and is going to require help from both the AM 
and YARN to do it properly.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-07-19 Thread Robert Joseph Evans (JIRA)

Robert Joseph Evans created YARN-941:


 Summary: RM Should have a way to update the tokens it has for a 
running application
 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans


When an application is submitted to the RM it includes with it a set of tokens 
that the RM will renew on behalf of the application, that will be passed to the 
AM when the application is launched, and will be used when launching the 
application to access HDFS to download files on behalf of the application.

For long lived applications/services these tokens can expire, and then the 
tokens that the AM has will be invalid, and the tokens that the RM had will 
also not work to launch a new AM.

We need to provide an API that will allow the RM to replace the current tokens 
for this application with a new set.  To avoid any real race issues, I think 
this API should be something that the AM calls, so that the client can connect 
to the AM with a new set of tokens it got using kerberos, then the AM can 
inform the RM of the new set of tokens and quickly update its tokens internally 
to use these new ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2013-07-19 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713961#comment-13713961
 ] 

Robert Joseph Evans commented on YARN-941:
--

I am punting on how/if we get the new HDFS token to NMs to be used for log 
aggregation.  We need to think a bit more about how logs should be handled for 
long lived services before we spend a lot of time trying to make log 
aggregation work.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-19 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713992#comment-13713992
]

Robert Joseph Evans commented on YARN-896:
--

I filed one new JIRA for updating tokens in the RM YARN-941.

I started to file a JIRA for the AM to be informed of the location of its
already running containers, but as I was writing it I realized that it will not
give us enough information to be able to reattach to the containers. The only
thing it will give us is enough info to be able to go shoot the containers.
Simply because there is no metadata about what port the container may be
listening on or anything like that. It seems to me that we would be better off
keeping a log, similar to the MR job history log, that has in it all the data
the AM needs to look for running containers. If others see a different need
for this API, I am still happy to file a JIRA for it.

I have not filed a JIRA for anti-affinity yet either. I seem to remember
another JIRA for something like this already, but I have not found it yet. I
figure we can add in a long lived process flag for the scheduler when we run
across a use case for it.

The other parts discussed here, either already have a JIRA associated with the
same functionality, or I think need a bit more discussion about exactly what we
want to do. Namely log aggregation/processing and Hadoop package
management/rolling upgrades of live applications.

If I missed something please let me know.

Roll up for long lived YARN
---

Key: YARN-896
URL: https://issues.apache.org/jira/browse/YARN-896
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Robert Joseph Evans

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-10 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704623#comment-13704623
 ] 

Robert Joseph Evans commented on YARN-896:
--

Chris, Yes I missed the app master retry issue.  Those two with the discussion 
on them seem to cover what we are looking for.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-09 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703622#comment-13703622
 ] 

Robert Joseph Evans commented on YARN-896:
--

No comments in the past few days.  I would like to hear from more people 
involved, even if it is just to say that it looks like we have everything 
covered here.  Then we can start filing JIRAs and getting some work done.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-02 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698500#comment-13698500
]

Robert Joseph Evans commented on YARN-896:
--

During the most recent Hadoop Summit there was a developer meetup where we
discussed some of these issues. This is to summarize what was discussed at
that meeting and to add in a few things that have also been discussed on
mailing lists and other places.

HDFS delegation tokens have a maximum life time. Currently tokens submitted to
the RM when the app master is launched will be renewed by the RM until the
application finishes and the logs from the application have finished
aggregating. The only token currently used by the YARN framework is the HDFS
delegation token. This is used to read files from HDFS as part of the
distributed cache and to write the aggregated logs out to HDFS.

In order to support relaunching an app master after the HDFS the maximum
lifetime of the HDFS delegation token, we either need to allow for tokens that
do not expire or provide an API to allow the RM to replace the old token with a
new one. Because removing the maximum lifetime of a token reduces the security
of the cluster as a whole I think it would be better to provide an API to
replace the token with a new one.

If we want to continue supporting log aggregation we also need to provide a way
for the Node Managers to get the new token too. It is assumed that each app
master will also provide an API to get the new token so it can start using it.

Log aggregation is another issue, although not required for long lived
applications to work. Logs are aggregated into HDFS when the application
finishes. This is not really that useful for applications that are never
intended to exit. Ideally the processing of logs by the node manager should be
pluggable so that clusters and applications can select how and when logs are
processed/displayed to the end user. Because many of these systems roll their
logs to avoid filling up disks we will probably need a protocol of some sort
for the container to communicate with the Node Manager when logs are ready to
be processed.

Another issue is to allow containers to out live the app master that launched
them and also to allow containers to outlive the node manager that launched
them. This is especially critical for the stability of applications durring
rolling upgrades to YARN.

Roll up for long lived YARN
---

Key: YARN-896
URL: https://issues.apache.org/jira/browse/YARN-896
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Robert Joseph Evans

[jira] [Commented] (YARN-896) Roll up for long lived YARN

2013-07-02 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698505#comment-13698505
 ] 

Robert Joseph Evans commented on YARN-896:
--

Another issue that has been discussed in the past is the impact that long lived 
processes can have on resource scheduling. It is possible for a long lived 
process to grab lots of resources and then never release them even though it is 
using more resources than it would be allowed to have when the cluster is full. 
 Recent preemption changes should be able to prevent this from happening 
between different queues/pools, but we may need to think if we need more 
control about this within a queue.

 Roll up for long lived YARN
 ---

 Key: YARN-896
 URL: https://issues.apache.org/jira/browse/YARN-896
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Joseph Evans

 YARN is intended to be general purpose, but it is missing some features to be 
 able to truly support long lived applications and long lived containers.
 This ticket is intended to
  # discuss what is needed to support long lived processes
  # track the resulting JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-05-20 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662101#comment-13662101
 ] 

Robert Joseph Evans commented on YARN-624:
--

I would love to have it right now for storm too. If you want me to sign up as a 
use case I am happy to. 

 Support gang scheduling in the AM RM protocol
 -

 Key: YARN-624
 URL: https://issues.apache.org/jira/browse/YARN-624
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
 scheduler runs a set of tasks when they can all be run at the same time, 
 would be a useful feature for YARN schedulers to support.
 Currently, AMs can approximate this by holding on to containers until they 
 get all the ones they need.  However, this lends itself to deadlocks when 
 different AMs are waiting on the same containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-20 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662105#comment-13662105
 ] 

Robert Joseph Evans commented on YARN-690:
--

Vinod,

Yes creating and resolving a JIRA in 2 hours is not ideal, but this is a 
Blocker that consisted on only a handful of lines of change, also the bylaws 
explicitly state that a waiting period is not needed for this vote because 
committers can retroactively -1 and pull the change out.  I agree that waiting 
to let others look at the code is good and if it were not a Blocker I would 
have waited.

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: YARN-690.patch, YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol

2013-05-20 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662352#comment-13662352
]

Robert Joseph Evans commented on YARN-624:
--

Storm is a real-time stream processing system. We are working on porting this
to run on YARN. Storm will process one or more streams of data using a logical
DAG of processing nodes called a topology. This topology runs in spawned
processes. If there are not enough processes to run a topology there is no
point in launching any of the processes. Hence the need for gang scheduling.

It is a very simple gang scheduling use case currently. When a new topology is
submitted we want to request enough resources to to run that topology. If a
node goes down, we are going to request enough resources to replace it, so we
can get up and running again ASAP. When a topology is killed we want to
release those resources.

Long term we would like to make sure that the different containers are close to
each other from a network topology perspective. We don't care which node or
rack the containers are on, but we do care that they are all on the same
node/rack as the other containers.

Support gang scheduling in the AM RM protocol
-

[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660001#comment-13660001
 ] 

Robert Joseph Evans commented on YARN-690:
--

I don't think this does what you want.  Now an IOException will cause the same 
issue.  I think you need to handle runtime and IOException separately. 

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660035#comment-13660035
 ] 

Robert Joseph Evans commented on YARN-690:
--

The change looks fine to me now. +1

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch, YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-528) Make IDs read only

2013-04-30 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645991#comment-13645991
 ] 

Robert Joseph Evans commented on YARN-528:
--

The approach seems OK to me, but I would rather have the impl be an even 
thinner wrapper.

{code}
  private ApplicationIdProto proto = null;
  private ApplicationIdProto.Builder builder = null;

  ApplicationIdPBImpl(ApplicationIdProto proto) {
this.proto = proto;
  }

  public ApplicationIdPBImpl() {
this.builder = ApplicationIdProto.newBuilder();
  }
 
  public ApplicationIdProto getProto() {
assert (proto != null);
return proto;
  }
 
  @Override
  public int getId() {
assert (proto != null);
return proto.getId();
  }
 
  @Override
  protected void setId(int id) {
assert (builder != null);
builder.setId((id));
  }

  @Override
  public long getClusterTimestamp() {
assert(proto != null);
return proto.getClusterTimestamp();
  }
 
  @Override
  protected void setClusterTimestamp(long clusterTimestamp) {
assert(builder != null);
builder.setClusterTimestamp((clusterTimestamp));
  }

  @Override
  protected void build() {
assert(builder != null);
proto = builder.build();
builder = null;
  }
{code}

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: y528_AppIdPart_01_Refactor.txt, 
 y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt, 
 y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-528) Make IDs read only

2013-04-29 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644784#comment-13644784
]

Robert Joseph Evans commented on YARN-528:
--

Thanks for doing this Sid. I started pulling on the string and there was just
too much involved, so I had to stop.

Make IDs read only
--

Key: YARN-528
URL: https://issues.apache.org/jira/browse/YARN-528
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Attachments: y528_AppIdPart_01_Refactor.txt,
y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt,
y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt

I really would like to rip out most if not all of the abstraction layer that
sits in-between Protocol Buffers, the RPC, and the actual user code. We have
no plans to support any other serialization type, and the abstraction layer
just, makes it more difficult to change protocols, makes changing them more
error prone, and slows down the objects themselves.
Completely doing that is a lot of work. This JIRA is a first step towards
that. It makes the various ID objects immutable. If this patch is wel
received I will try to go through other objects/classes of objects and update
them in a similar way.
This is probably the last time we will be able to make a change like this
before 2.0 stabilizes and YARN APIs will not be able to be changed.

[jira] [Commented] (YARN-528) Make IDs read only

2013-04-03 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621023#comment-13621023
]

Robert Joseph Evans commented on YARN-528:
--

OK, I understand now.

I will try to find some time to play around with getting the AM ID to not have
a wrapper at all.

Make IDs read only
--

[jira] [Created] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

Robert Joseph Evans created YARN-528:


 Summary: Make IDs read only
 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans


I really would like to rip out most if not all of the abstraction layer that 
sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
no plans to support any other serialization type, and the abstraction layer 
just, makes it more difficult to change protocols, makes changing them more 
error prone, and slows down the objects themselves.  

Completely doing that is a lot of work.  This JIRA is a first step towards 
that.  It makes the various ID objects immutable.  If this patch is wel 
received I will try to go through other objects/classes of objects and update 
them in a similar way.

This is probably the last time we will be able to make a change like this 
before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Joseph Evans updated YARN-528:
-

Attachment: YARN-528.txt

This patch contains changes to both Map/Reduce IDs as well as YARN APIs. I
don't really want to split them up right now, but I am happy to file a separate
JIRA for tracking purposes if the community decides this is a direction we want
to go in.

Make IDs read only
--

Key: YARN-528
URL: https://issues.apache.org/jira/browse/YARN-528
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Robert Joseph Evans
Attachments: YARN-528.txt

[jira] [Assigned] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-528:


Assignee: Robert Joseph Evans

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619911#comment-13619911
 ] 

Robert Joseph Evans commented on YARN-528:
--

The build failed, because it needs to be upmerged, again :(

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-528:
-

Attachment: YARN-528.txt

Upmerged

 Make IDs read only
 --

 Key: YARN-528
 URL: https://issues.apache.org/jira/browse/YARN-528
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
 Attachments: YARN-528.txt, YARN-528.txt


 I really would like to rip out most if not all of the abstraction layer that 
 sits in-between Protocol Buffers, the RPC, and the actual user code.  We have 
 no plans to support any other serialization type, and the abstraction layer 
 just, makes it more difficult to change protocols, makes changing them more 
 error prone, and slows down the objects themselves.  
 Completely doing that is a lot of work.  This JIRA is a first step towards 
 that.  It makes the various ID objects immutable.  If this patch is wel 
 received I will try to go through other objects/classes of objects and update 
 them in a similar way.
 This is probably the last time we will be able to make a change like this 
 before 2.0 stabilizes and YARN APIs will not be able to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-528) Make IDs read only

2013-04-02 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620326#comment-13620326
]

Robert Joseph Evans commented on YARN-528:
--

I am fine with splitting the MR changes from the YARN change like I said, I put
this out here more to be a question of how do we want to go about implementing
theses changes, and the test was more of a prototype example.

I personally lean more towards using the *Proto classes directly. Why have
something else wrapping it if we don't need it, even if it is a small and
simple layer. The only reason I did not go that route here is because of
toString(). With the IDs we rely on having ID.toString() turn into something
very specific that can be parsed and turned back into an instance of the
object. If I had the time I would trace down all places where we call toString
on them and replace it with something else. I may just scale back the scope of
the patch to look at ApplicationID to begin with and try to see if I can
accomplish this.

bq. Wrapping the object which came over the wire - with a goal of creating
fewer objects.

I really don't understand how this is supposed to work. How do we create fewer
objects by wrapping them in more objects? I can see us doing something like
deduping the objects that come over the wire, but I don't see how wrapping
works here.

Make IDs read only
--

[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-04-01 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618784#comment-13618784
 ] 

Robert Joseph Evans commented on YARN-515:
--

Having people always test the patches in secure mode I think is a bit too high 
of a barrier for some.  I personally hate having to get it all set up to be 
able to test a patch.  Registration responses in general were broken.  The NM 
would never get a reboot signal either.  It was always the default enum value 
of everything is fine.  I am just glad that we caught it. 

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Blocker
 Fix For: 2.0.5-beta

 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-29 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617367#comment-13617367
 ] 

Robert Joseph Evans commented on YARN-112:
--

Vinod,

I just glanced at the latest patch, I did not read it in detail, so if you say 
it covers that case I trust you.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617373#comment-13617373
]

Robert Joseph Evans commented on YARN-515:
--

This issue appears to be cause by a bug in RegisterNodeManagerResponsePBImpl.
I think specifically it was caused by YARN-440. I have a unit test that can
reproduce it. Sid reviewed YARN-440 and he is a really smart guy. I looked at
it thinking that it must be the cause of the issue and I didn't see anything in
there that was off.

I just think all this extra code to try and wrap the protocol buffers is just a
bad idea. It makes things difficult to change a .proto file, and it just slows
things down. But it is a lot of work to change it so I am done with my rant
now, I'll go find a fix for the issue.

Node Manager not getting the master key
---

Key: YARN-515
URL: https://issues.apache.org/jira/browse/YARN-515
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

On branch-2 the latest version I see the following on a secure cluster.
{noformat}
2013-03-28 19:21:06,243 [main] INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security
enabled - updating secret keys now
2013-03-28 19:21:06,243 [main] INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered
with ResourceManager as RM:PORT with total resource of me
mory:12288, vCores:16
2013-03-28 19:21:06,244 [main] INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
started.
2013-03-28 19:21:06,245 [main] INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-03-28 19:21:07,257 [Node Status Updater] ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
{noformat}
The Null pointer exception just keeps repeating and all of the nodes end up
being lost. It looks like it never gets the secret key when it registers.

[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617379#comment-13617379
 ] 

Robert Joseph Evans commented on YARN-515:
--

Yes the issue is that there is a rebuild flag in the PBImpl that is never set 
to true, so it will never rebuild the proto.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-515:
-

Attachment: YARN-515.txt

This should fix the issue.  We forgot to tell the wrapper to rebuild after 
setting some values.

There is a unit test included that shows the problem.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker
 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (YARN-515) Node Manager not getting the master key

2013-03-29 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned YARN-515:


Assignee: Robert Joseph Evans

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Blocker
 Attachments: YARN-515.txt


 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325
]

Robert Joseph Evans commented on YARN-112:
--

I agree that scale exposes races but, still the underlying problem is that we
want to create a new unique directory. This seems very simple.

{code}
File uniqueDir = null;
do {
uniqueDir = new File(baseDir, String.valueOf(rand.nextLong()));
} while (!uniqueDir.mkdir());
{code}

I don't see why we are going through all of this complexity, simply because a
FileContext API is broken. Playing games to make the race less likely is fine.
But ultimately we still have to handle the race.

Race in localization can cause containers to fail
-

Key: YARN-112
URL: https://issues.apache.org/jira/browse/YARN-112
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch,
yarn-112-20130326.patch, yarn-112.20131503.patch

On one of our 0.23 clusters, I saw a case of two containers, corresponding to
two map tasks of a MR job, that were launched almost simultaneously on the
same node. It appears they both tried to localize job.jar and job.xml at the
same time. One of the containers failed when it couldn't rename the
temporary job.jar directory to its final name because the target directory
wasn't empty. Shortly afterwards the second container failed because job.xml
could not be found, presumably because the first container removed it when it
cleaned up.

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327
 ] 

Robert Joseph Evans commented on YARN-112:
--

Oh and the latest patch using a unique number will not always work, because the 
same code is used from different processes on the same box.  We would have to 
have a way to guarantee uniqueness between the different processes.  
CurrentTimeMillis helps but still could result in a race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)

Robert Joseph Evans created YARN-515:


 Summary: Node Manager not getting the master key
 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker


On branch-2 the latest version I see the following on a secure cluster.

{noformat}
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
enabled - updating secret keys now
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
with ResourceManager as RM:PORT with total resource of me
mory:12288, vCores:16
2013-03-28 19:21:06,244 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
started.
2013-03-28 19:21:06,245 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
{noformat}

The Null pointer exception just keeps repeating and all of the nodes end up 
being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616628#comment-13616628
 ] 

Robert Joseph Evans commented on YARN-515:
--

OK It actually looks like the NM is trying to get the Master Key, before it 
ever has set it, which is causing the NPE.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-515) Node Manager not getting the master key

2013-03-28 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616698#comment-13616698
 ] 

Robert Joseph Evans commented on YARN-515:
--

This is really odd.  I put in logging in the ResourceTrackerService and in the 
NodeStatusUpdaterImpl.  The RM sets the secret key in the 
RegisterNodeManagerResponse, but the NM only sees a null come out for it.  
Because of that the heartbeat always fails with the NPE trying to read 
something that was never set.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-26 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614036#comment-13614036
 ] 

Robert Joseph Evans commented on YARN-378:
--

Hitesh and Vinod,

It is not a big deal. I realized that both were going in, and I am glad that 
this is ready and has gone in.  It is a great feature. It just would have been 
nice to either commit them at the same time, or give a heads up on the mailing 
list that you were going to break the build for a little while.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Fix For: 2.0.5-beta

 Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, 
 YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, 
 YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, 
 YARN-378_9.patch, YARN_378-final-commit.patch, 
 YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614402#comment-13614402
 ] 

Robert Joseph Evans commented on YARN-112:
--

I am not really sure that we fixed the underlying issue.  

{code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code}

threw an exception because there was something else in that directory already, 
but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a 
FileAlreadyExistsException if the directory already exists.  

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29

files.rename should never get into this situation if files.rename threw the 
exception when it was supposed to.

I tested this and 
{code}
FileContext lfc = FileContext.getLocalFSFileContext(new Configuration());
Path p = new Path(/tmp/bobby.12345);
FsPermission cachePerms = new FsPermission((short) 0755);
lfc.mkdir(p, cachePerms, false);
lfc.mkdir(p, cachePerms, false);
{code}

never throws an exception.  We first need to address the bug in FileContext, 
and then we can look at how we can make FSDownload deal with mkdir throwing an 
exception, or whatever the fix ends up being.

I filed HADOOP-9438 for this.

If the fix ends up being that we do not support throwing the exception in 
FileContext, then your current solution looks OK.

I also have a hard time believing that we are getting random collisions on a 
long value that should be fairly uniformly distributed.  We need to guard 
against it either way and I suppose it is possible, but if I remember correctly 
we were seeing a significant number of these errors and my gut tells me that 
there is either something very wrong with Random, or there is something else 
also going on here.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-25 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reopened YARN-378:
--


Looks like something was missed

{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) 
on project hadoop-mapreduce-client-app: Compilation failure: Compilation 
failure:
[ERROR] 
/home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[227,52]
 cannot find symbol
[ERROR] symbol  : variable RM_AM_MAX_RETRIES
[ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
[ERROR] 
/home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[228,25]
 cannot find symbol
[ERROR] symbol  : variable DEFAULT_RM_AM_MAX_RETRIES
[ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
{noformat}

Please fix this ASAP.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Fix For: 2.0.5-beta

 Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, 
 YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, 
 YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, 
 YARN-378_9.patch, YARN_378-final-commit.patch, 
 YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives

2013-03-21 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608919#comment-13608919
]

Robert Joseph Evans commented on YARN-109:
--

The findbugs is complaining that you are ignoring the return value of the
delete call. It should not be a problem so either use the return value to log
a warning when it fails or update the findbugs filter to filter out the error.

The -1 for the test timeouts is caused by a bug in the script used to detect
these, so you can either ignore it, or add a timeout to any @Test that appears
in the patch file, including the ones you didn't add :(.

In the test, please uncomment the lines to cleanup after the test. Are they
causing a problem for the test to pass? or was it just for debugging?

Also I personally would prefer to have a few small jar/tar/zip files checked
into the repository instead of generating them on they fly for the test. It
will speed up the test and have less dependencies on the system being set up
with the exact commands, i.e. bash for windows support. Although if you don't
feel like changing it I am fine with that too, most of those commands are used
by FSDownload already so it is not that critical.

.tmp file is not deleted for localized archives
---

Key: YARN-109
URL: https://issues.apache.org/jira/browse/YARN-109
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 0.23.3, 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Mayank Bansal
Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch,
YARN-109-trunk-3.patch, YARN-109-trunk.patch

When archives are localized they are initially created as a .tmp file and
unpacked from that file. However the .tmp file is not deleted afterwards.

[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives

2013-03-21 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609509#comment-13609509
 ] 

Robert Joseph Evans commented on YARN-109:
--

That is fine with me.  My concern was mostly with Windows support. tar, zip, 
jar, etc. should be there, but bash may not be. So if you want to file a new 
JIRA that is fine, if not you can just wait for windows support to be merged in 
and see if it breaks.

 .tmp file is not deleted for localized archives
 ---

 Key: YARN-109
 URL: https://issues.apache.org/jira/browse/YARN-109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3, 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Mayank Bansal
 Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch, 
 YARN-109-trunk-3.patch, YARN-109-trunk-4.patch, YARN-109-trunk.patch


 When archives are localized they are initially created as a .tmp file and 
 unpacked from that file.  However the .tmp file is not deleted afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-14 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602333#comment-13602333
 ] 

Robert Joseph Evans commented on YARN-378:
--

Using the environment variables works for other applications too.  That is the 
only way to get some pieces of critical information that are needed for 
registration with the RM.  

On Windows there are limits 

http://msdn.microsoft.com/en-us/library/windows/desktop/ms682653%28v=vs.85%29.aspx

But they should not cause too much of an issue on Windows Server 2008 and above.

I would prefer for us to only return the information to the AM one way.  Either 
though thrift or through the environment variable just so there is less 
confusion, but I am not adamant about it.



 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, 
 YARN-378_7.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-14 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602341#comment-13602341
 ] 

Robert Joseph Evans commented on YARN-378:
--

Looking at the code too I am fine with renaming retries to attempts.  But we 
need to mark this JIRA as an incompatible change or put in a deprecated config 
mapping.  We are early enough in YARN that deprecating it seems like a waste.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, 
 YARN-378_7.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-226) Log aggregation should not assume an AppMaster will have containerId 1

2013-03-14 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602661#comment-13602661
 ] 

Robert Joseph Evans commented on YARN-226:
--

Big means amount of memory/CPU relative to the minimum allocation size.  For 
example you ask for a 4 GB container with a min allocation size of 500MB.

 Log aggregation should not assume an AppMaster will have containerId 1
 --

 Key: YARN-226
 URL: https://issues.apache.org/jira/browse/YARN-226
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth

 In case of reservcations, etc - AppMasters may not get container id 1. We 
 likely need additional info in the CLC / tokens indicating whether a 
 container is an AM or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-13 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601237#comment-13601237
]

Robert Joseph Evans commented on YARN-378:
--

The patch looks good to me. The only problem I have is with how we are
informing the AM of the maximum number of retires that it has. This should
work, but it is going to require a lot of changes to the MR AM to use it.
Right now the number is used in the init of MRAppMaster, but we will not get
that information until start() is called and we register with the RM. I would
much rather see a new environment variable added that can hold this
information, because it makes MAPREDUCE-5062 much simpler. But I am OK with
the way it currently is.

ApplicationMaster retry times should be set by Client
-

Key: YARN-378
URL: https://issues.apache.org/jira/browse/YARN-378
Project: Hadoop YARN
Issue Type: Sub-task
Components: client, resourcemanager
Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
Labels: usability
Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch,
YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch

We should support that different client or user have different
ApplicationMaster retry times. It also say that
yarn.resourcemanager.am.max-retries should be set by client.

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-11 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599236#comment-13599236
]

Robert Joseph Evans commented on YARN-237:
--

Sorry to keep adding more things in here, but JQuery.java is a generic part of
YARN. It, in theory, can be used by others not just Map/Reduce and YARN.
Encoding in a special case for the tasks table is not acceptable. You should
be able to get the same functionality by switching to the DATATABLES_SELECTOR
for those tables.

We also need to address the find bugs issues. You are dereferencing type to
create the ID of the tasks and type could be null, although in practice it
should never be. Also there is no need to call toString() on type when using +
with another string. This may fix the find bugs issues too, although it would
not be super clean.

Refreshing the RM page forgets how many rows I had in my Datatables
---

Key: YARN-237
URL: https://issues.apache.org/jira/browse/YARN-237
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
Labels: usability
Attachments: YARN-237.patch, YARN-237.v2.patch, YARN-237.v3.patch

If I choose a 100 rows, and then refresh the page, DataTables goes back to
showing me 20 rows.
This user preference should be stored in a cookie.

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-11 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599240#comment-13599240
 ] 

Robert Joseph Evans commented on YARN-378:
--

I am perfectly fine with that.  It seems like more overhead, but I am fine 
either way.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability
 Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, 
 YARN-378_4.patch


 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-08 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597428#comment-13597428
]

Robert Joseph Evans commented on YARN-378:
--

From a quick look it seems OK.

It would be nice for isLastAMRetry to remain private and have a getter. That
way it prevents unintended writes to it.

I also don't really like having the AM guess how many retries there will be. I
thought it was ugly when I add that code, and now that it logic is more complex
I really know why. Could you please file a JIRA so the RM and inform the AM
how many AM retires it has, or if you have time just add it in as part of this
JIRA. That way the AM will never have to adjust its logic again.

Also could we make the code a little more robust. In both the AM and the RM
instead of checking for just -1 could you check for anything that is = 0. If
anyone sets the retries to be that small it should use the default. I am not
sure what having a max retries of -2 means and what it would do to an
application.

ApplicationMaster retry times should be set by Client
-

We should support that different client or user have different
ApplicationMaster retry times. It also say that
yarn.resourcemanager.am.max-retries should be set by client.

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-08 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597508#comment-13597508
]

Robert Joseph Evans commented on YARN-237:
--

I have a few more comments.

It is great that you fixed the issues, but now we have a leak in the browser.
You have tied the table ID to the localStorage key, and then for a couple of
tables you have included the jobID in the table ID. This means that new
entries will be placed in the localStorage for every job page I visit and those
entires will never be deleted.

I see two ways to fix this. We can ether change it over to be sessionStorage
instead of localStorage, because it goes away after the session ends. Or we
can remove the jobID from the table names. If we remove the jobID corresponding
tables on different pages will share a single state. If we use sessionStorage
the data will only be saved for a given browser session. If I close the
browser and reopen it the state will be lost. I tend to think the first one is
preferable, but that is just me.

Also could you please update the code format to meet our guidelines. There are
a few places where it does not meet the guidelines.

Refreshing the RM page forgets how many rows I had in my Datatables
---

If I choose a 100 rows, and then refresh the page, DataTables goes back to
showing me 20 rows.
This user preference should be stored in a cookie.

[jira] [Updated] (YARN-456) allow OS scheduling priority of NM to be different than the containers it launches for Windows

2013-03-07 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-456:
-

Summary: allow OS scheduling priority of NM to be different than the 
containers it launches for Windows  (was: Add similar support for Windows)

 allow OS scheduling priority of NM to be different than the containers it 
 launches for Windows
 --

 Key: YARN-456
 URL: https://issues.apache.org/jira/browse/YARN-456
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Bikas Saha
Assignee: Bikas Saha



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-05 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593736#comment-13593736
 ] 

Robert Joseph Evans commented on YARN-378:
--

I don't really want the client config to be called 
yarn.resourcemanager.am.max-retries.  That is a YARN resource manager config, 
and is intended to be used by the RM, not by the map reduce client.  I would 
much rather have a mapreduce.am.max-retries that the MR client reads and uses 
to populate the ApplicationSubmissionContext.

 ApplicationMaster retry times should be set by Client
 -

 Key: YARN-378
 URL: https://issues.apache.org/jira/browse/YARN-378
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
 Environment: suse
Reporter: xieguiming
Assignee: Zhijie Shen
  Labels: usability

 We should support that different client or user have different 
 ApplicationMaster retry times. It also say that 
 yarn.resourcemanager.am.max-retries should be set by client. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client

2013-03-05 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593863#comment-13593863
]

Robert Joseph Evans commented on YARN-378:
--

But the config *is* specific to mapreduce. Every other application client will
have to provide their own way of putting that value into the container launch
context. It could be through a hadoop config or it could be through something
else entirely.

I am in the process of porting Storm to run on top of YARN. I don't see us
ever using a Hadoop Configuration in the client except the default one to be
able to access HDFS. Storm has its own configuration object and for better
integration with Storm I would set up a Storm conf for that, although in
reality I would probably just never set it because I never want it to go down
entirely, and that is how I would get the maximum number of retries allowed by
the cluster.

I can see other applications that already exist and are being ported to run on
YARN, like OpenMPI, to want to set that config in a way that is consistent with
their current configuration and not in a Hadoop specific way.

ApplicationMaster retry times should be set by Client
-

We should support that different client or user have different
ApplicationMaster retry times. It also say that
yarn.resourcemanager.am.max-retries should be set by client.

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-01 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590842#comment-13590842
]

Robert Joseph Evans commented on YARN-237:
--

localStorage is not per page it is per domain, so that means if two pages in
the same domain have tables named the same they will share a config in local
storage. So for example if I run a map/reduce job and I sort the map tasks by
elapsed time, the reduce tasks will also be sorted by elapsed time when I go to
their page. The good news is that if I sort the reduces by an ID that the maps
don't know about the maps page just ignores it, but it resets the sorting for
the reducers not too.

But this produces even stranger behavior in the counters page. Because the
counters use a selector, multiple tables on the same page now all share a saved
state. So if I sort the counters by a column and then reload all of the
counters are now sorted by that column.

I am not positive what the best way is to fix these. We want to provide a way
for each data table to have a unique storage key across all tables in the
domain, even with the selector. We don't want to use the page path or anything
like that because that will create a new group of settings per page, and that
would result in filling up their localStorage, unless of course we used the
sessionStorage instead. But using sessionStorage would mean that each time we
opened up a new session we would have to re-do the settings. sessionStorage
also does not fix the issue with counters and the selector where we have
multiple tables all sharing a single ID and single storage.

Refreshing the RM page forgets how many rows I had in my Datatables
---

If I choose a 100 rows, and then refresh the page, DataTables goes back to
showing me 20 rows.
This user preference should be stored in a cookie.

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-03-01 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590999#comment-13590999
 ] 

Robert Joseph Evans commented on YARN-237:
--

The code that Jian wrote is working on the RM as well as the AM, I tested it.  
The patch changes code that is common to both of them.  The issues I mentioned 
are not theoretical.  The reason it works on the AM is because it is not using 
a cookie, instead it is using an HTML5 concept for local storage.

If we want to restrict these to just be for the RM that does seem to fix the 
issue.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash
Assignee: jian he
  Labels: usability
 Attachments: YARN-237.patch


 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2013-02-28 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589976#comment-13589976
]

Robert Joseph Evans commented on YARN-237:
--

The change looks more or less OK to me. I am not thrilled about how we modify
the data table's init string by looking for the first '{', but I think it is
OK. I just have a few concerns, and most if it deals with my lack of knowledge
about jQuery and localStorage. I know that localStorage is not supported on
all browsers. I also know that localStorage can throw a QUOTA_EXCEEDED
exception. What happens when we run into these situations? Will the page stop
working or will jQuery degrade gracefully and simply not allow us to save the
data. What about if the data stored in the key is not what we expect. Will
jQuery make the page unusable. We currently have tables with the same name on
different pages. If they are not kept in sync there could be some issues with
the data that is saved.

Which brings up another point I am also a bit concerned about the key we are
using as part of the localStorage. The key is the id of the data table. I
would prefer it if we could some how make it obvious that these values are for
a data table, and not some other apps storage.

Refreshing the RM page forgets how many rows I had in my Datatables
---

If I choose a 100 rows, and then refresh the page, DataTables goes back to
showing me 20 rows.
This user preference should be stored in a cookie.

[jira] [Commented] (YARN-426) Failure to download a public resource on a node prevents further downloads of the resource from that node

2013-02-27 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588424#comment-13588424
 ] 

Robert Joseph Evans commented on YARN-426:
--

The patch looks good to me. +1 I'll check it in.

 Failure to download a public resource on a node prevents further downloads of 
 the resource from that node
 -

 Key: YARN-426
 URL: https://issues.apache.org/jira/browse/YARN-426
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha, 0.23.6
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-426.patch


 If the NM encounters an error while downloading a public resource, it fails 
 to empty the list of request events corresponding to the resource request in 
 {{attempts}}.  If the same public resource is subsequently requested on that 
 node, {{PublicLocalizer.addResource}} will skip the download since it will 
 mistakenly believe a download of that resource is already in progress.  At 
 that point any container that requests the public resource will just hang in 
 the {{LOCALIZING}} state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570329#comment-13570329
]

Robert Joseph Evans commented on YARN-371:
--

Tom just like Arun said the memory usage changes based off of the size of the
cluster vs. the size of the request. The current approach is on the order of
the size of the cluster where as the proposed approach is on the order of the
number of desired containers. If I have a 100 node cluster and I am requesting
10 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if
reducers are included in it. What is more it is probably exactly the same size
of request for 1 or even 1000 tasks. Where as the proposed approach would
grow without bound as the number of tasks also increased.

However, I also agree with Sandy that the current state compression is lossy
and as such restricts what is possible in the scheduler. I would like to
understand better what the size differences would be for various requests, both
in memory and also over the wire. It seems conceivable to me that if the size
difference is not too big, especially over the wire, we could allow the
scheduler itself to decide on its in memory representation. This would allow
for the Capacity Scheduler to keep its current layout and allow for others to
experiment with more advanced scheduling options. Different groups could
decide which scheduler best fits their needs and workload. If the size is
significantly larger I would like to see hard numbers about how much
better/worse it makes specific use cases.

I am also very concerned about adding too much complexity to the scheduler. We
have run into issues where the RM will get very far behind in scheduling
because it is trying to do a lot already and eventually OOM as its event queue
grows too large.

I also don't want to change the scheduler protocol too much without first
understanding how that new protocol would impact other potential scheduling
features. There are a number of other computing patterns that could benefit
from specific scheduler support. Things like gang scheduling where you need
all of the containers at once or none of them can make any progress, or where
you want all of the containers to be physically close to one another because
they are very I/O intensive, but you don't really care where exactly they are.
Or even something like HBase where you essentially want one process on every
single node with no duplicates. Do the proposed changes make these uses case
trivially simple, or do they require a lot of support on the AM to implement
them?

Consolidate resource requests in AM-RM heartbeat

Key: YARN-371
URL: https://issues.apache.org/jira/browse/YARN-371
Project: Hadoop YARN
Issue Type: Improvement
Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

Each AMRM heartbeat consists of a list of resource requests. Currently, each
resource request consists of a container count, a resource vector, and a
location, which may be a node, a rack, or *. When an application wishes to
request a task run in multiple localtions, it must issue a request for each
location. This means that for a node-local task, it must issue three
requests, one at the node-level, one at the rack-level, and one with * (any).
These requests are not linked with each other, so when a container is
allocated for one of them, the RM has no way of knowing which others to get
rid of. When a node-local container is allocated, this is handled by
decrementing the number of requests on that node's rack and in *. But when
the scheduler allocates a task with a node-local request on its rack, the
request on the node is left there. This can cause delay-scheduling to try to
assign a container on a node that nobody cares about anymore.
Additionally, unless I am missing something, the current model does not allow
requests for containers only on a specific node or specific rack. While this
is not a use case for MapReduce currently, it is conceivable that it might be
something useful to support in the future, for example to schedule
long-running services that persist state in a particular location, or for
applications that generally care less about latency than data-locality.
Lastly, the ability to understand which requests are for the same task will
possibly allow future schedulers to make more intelligent scheduling
decisions, as well as permit a more exact understanding of request load.
I would propose the tweak of allowing a single ResourceRequest to encapsulate
all the location information for a task. So instead of just a single
location, a

[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-04 Thread Robert Joseph Evans (JIRA)

[
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570619#comment-13570619
]

Robert Joseph Evans commented on YARN-371:
--

I didn't really expect them to be trivial :). So I think that there may be some
value in having a different protocol, but we need some hard numbers to be able
to really make an informed decision.

I would like to see the size of a request in the following table (both in
memory size on the RM and size sent over the wire)

||nodes(down)/tasks(across)||1,000||10,000||100,000||500,000||
||100|?|?|?|?|
||1,000|?|?|?|?|
||4,000|?|?|?|?|
||10,000|?|?|?|?|

It would also be great to see in practice how bad is the scheduling problem
where the wrong node is sent.

Resource-centric compression in AM-RM protocol limits scheduling

[jira] [Commented] (YARN-225) Proxy Link in RM UI thows NPE in Secure mode

2012-12-28 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540470#comment-13540470
 ] 

Robert Joseph Evans commented on YARN-225:
--

That does look to be the correct patch assuming that the stack trace was 
against 2.0.2 or before. Either way it is a fix that needs to go in, because I 
misread the HttpServletRequestJavadocs and missed or null if the request has 
no cookies. The fix needs to go into branch-0.23 as well. I am +1 for the fix 
and will check it in. 

 Proxy Link in RM UI thows NPE in Secure mode
 

 Key: YARN-225
 URL: https://issues.apache.org/jira/browse/YARN-225
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha, 2.0.3-alpha
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Critical
 Attachments: YARN-225.patch


 {code:xml}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:241)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:975)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
   at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
   at org.mortbay.jetty.Server.handle(Server.java:326)
   at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
   at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
   at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-293:
-

 Priority: Critical  (was: Major)
 Target Version/s: 0.23.6
Affects Version/s: 0.23.3

It looks like some of the wiring is in place for this.  We just need to send an 
ABORT_LOCALIZATION event when the RM tells the NM the app is done.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Priority: Critical

 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540006#comment-13540006
 ] 

Robert Joseph Evans commented on YARN-293:
--

Sorry looking at it more closely it is actually per container ID, so we need to 
send an event when the container is cleaned up.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Priority: Critical

 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container

2012-12-27 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated YARN-293:
-

Attachment: YARN-293-trunk.txt

This turned out to be a much smaller change then I originally thought.  I just 
added in the cleanup to a handled that was already being called for all 
containers to delete the container's resources.

 Node Manager leaks LocalizerRunner object for every Container 
 --

 Key: YARN-293
 URL: https://issues.apache.org/jira/browse/YARN-293
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Robert Joseph Evans
Priority: Critical
 Attachments: YARN-293-trunk.txt


 Node Manager creates a new LocalizerRunner object for every container and 
 puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but 
 it never removes from the map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-2) Enhance CS to schedule accounting for both memory and cpu cores

2012-12-27 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540081#comment-13540081
 ] 

Robert Joseph Evans commented on YARN-2:


I chatted with Arun off line a bit about this, and he pointed out to me that 
the APIs are marked as Evolving, I should read the patch more closely next 
time.  So I am OK with putting it in with the API as it is.  I still think that 
having a float for the API is preferable, but until we actually start using it 
in practice we will not know what the real issues are.

 Enhance CS to schedule accounting for both memory and cpu cores
 ---

 Key: YARN-2
 URL: https://issues.apache.org/jira/browse/YARN-2
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, scheduler
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 2.0.3-alpha

 Attachments: MAPREDUCE-4327.patch, MAPREDUCE-4327.patch, 
 MAPREDUCE-4327.patch, MAPREDUCE-4327-v2.patch, MAPREDUCE-4327-v3.patch, 
 MAPREDUCE-4327-v4.patch, MAPREDUCE-4327-v5.patch, YARN-2-help.patch, 
 YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, 
 YARN-2.patch


 With YARN being a general purpose system, it would be useful for several 
 applications (MPI et al) to specify not just memory but also CPU (cores) for 
 their resource requirements. Thus, it would be useful to the 
 CapacityScheduler to account for both.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2012-12-20 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537168#comment-13537168
 ] 

Robert Joseph Evans commented on YARN-276:
--

I am not an expert on the scheduler code, so I have not done an in depth review 
of the patch.  My biggest concern with this is that there is no visibility in 
the UI/web services about why an app may not have been scheduled.  It would be 
great if you could update CapacitySchedulerLeafQueueInfo.java and the web page 
that uses it CapacitySchedulerPage.java.

 Capacity Scheduler can hang when submit many jobs concurrently
 --

 Key: YARN-276
 URL: https://issues.apache.org/jira/browse/YARN-276
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0, 2.0.1-alpha
Reporter: nemon lou
 Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
 YARN-276.patch, YARN-276.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
 scheduler can hang with most resources taken up by AM and don't have enough 
 resources for tasks.And then all applications hang there.
 The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not 
 check directly.Instead ,this property only used for maxActiveApplications. 
 And maxActiveApplications is computed by minimumAllocation (not by Am 
 actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-204) test coverage for org.apache.hadoop.tools

2012-11-27 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504710#comment-13504710
 ] 

Robert Joseph Evans commented on YARN-204:
--

+1 the new changes look better. I'll check this in.

 test coverage for org.apache.hadoop.tools
 -

 Key: YARN-204
 URL: https://issues.apache.org/jira/browse/YARN-204
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-204-branch-0.23-a.patch, 
 YARN-204-branch-0.23-b.patch, YARN-204-branch-0.23.patch, 
 YARN-204-branch-2-a.patch, YARN-204-branch-2-b.patch, 
 YARN-204-branch-2.patch, YARN-204-trunk-a.patch, YARN-204-trunk-b.patch, 
 YARN-204-trunk.patch


 Added some tests for org.apache.hadoop.tools

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2012-11-26 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503890#comment-13503890
 ] 

Robert Joseph Evans commented on YARN-237:
--

You have to be careful with cookies because the web app proxy strips out 
cookies before sending the data to the application.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash

 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 174 matches

Mail list logo