[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226204#comment-15226204 ] Robert Joseph Evans commented on YARN-4757: --- I am not suggesting there is a DNS based solution. I am not a DNS expert and was hopeful there could at least be a DNS based mitigation possible, but that hope has now faded. I wanted to bring it up for discussion as part of the design so we go into this with our eyes wide open, and that at a minimum documenting it with examples for "fixing" it becomes a part of the final product. That did not happen for the initial registry service, but probably should have. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226196#comment-15226196 ] Robert Joseph Evans commented on YARN-4757: --- I think that is through the naming convention, and the DNS configuration on the desktop in a foreign county. I imagine that if I were doing this I would set it up so that the Hadoop DNS server would handle a set of sub-domains in my companies internal DNS setup. Then when my desktop is setup, or when my laptop connects to the VPN, the DNS server that it talks to would be configured to include one that also knows about the Hadoop setup. But that is just my guess. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214258#comment-15214258 ] Robert Joseph Evans commented on YARN-4757: --- There are lots of ways to "fix" these issues on a case by case basis. I mostly want to be sure that any documentation around YARN and service discovery is very clear that there are inherent races that can happen on shared infrastructure. YARN/Slider cannot fix them for end users and any client talking to a secure application/server should validate that the server is the correct and expected server. Concrete examples of how to do this would be great. This is not a new issue. It has existed since the registry service was first implemented. We are simply making it much easier for a user to integrate off the shelf components that are coming from a more traditional infrastructure/deployment where this is not necessarily a concern. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210513#comment-15210513 ] Robert Joseph Evans commented on YARN-4757: --- My concern about mutual authentication is mostly around documentation and asking if there is anything we can do to mitigate possible issues/attacks. Instead of talking about exact attacks lets talk about a few accidents that could happen, and can happen today, but are less likely because when I update my client to use the Registry API I make different assumptions about things. Lets say I am running a web service on YARN, and I want my customers to be able to get to me in through existing tools. So I set this all up and I have them go to http://api.bobby.yarncluster.myCompany.com:/ (or something else that matches the naming convention you had, I don't remember exactly and it is not relevant) First of all I have no way to guarantee that is open on any node, so it is a pain but I try to launch several web servers, finally get a few to come up, the others fail and get relaunched on other nodes eventually they are all up and running, and the AM puts all of them into the registry service. Things are going well. Customers are running using my service and everyone is happy. But then I do a rolling upgrade and I kill one container and launch a new one on another box. In the mean time some other container on the box I was running on grabs port and brings up an internal web UI for it. Now many of my customers trying to hit my web service but get this other process and see 404 errors, etc. Because DNS is eventual consistency, and there is a lot of caching happening, there is a race. If the client does not authenticate the server, like with https, then someone malicious could exploit this to do all kinds of things. I am simply saying that many people trust DNS a lot more than they should in their protocols, more so when they feel that they have DNSSEC turned on internally and they are going to an internal address that they can "trust". By exposing YARN through DNS it did not make it any less secure, it just made it a lot simpler for someone to deploy something that is insecure. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210465#comment-15210465 ] Robert Joseph Evans commented on YARN-4757: --- [~jmaron], Thanks for the answers. As for the SRV records and which IP address is returned it might be good to make it more clear in the document what you are proposing. Security makes since, it is almost exactly the same as what we do for storm (I really wish ZK had delegation tokens though). My concern was not about do we have the ability to return multiple addresses. My concern was mostly about how many can we return. Typically the address returned for google.com, etc are actually pointing to a very large load balancer and not individual web servers. So the number of entries returned is on the order of the number of data centers someone has, or even more likely it is even higher level and it is around the number of geographic regions. At Yahoo we run very large HBase clusters. I'm not sure how well tools would handle getting back 2000 IP addressed for a record. I mostly want to understand what if any theoretical limits there are to this technology and what if any practical limits there are. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209067#comment-15209067 ] Robert Joseph Evans commented on YARN-4757: --- I also did a quick pass through the document and I wanted to clarify a few things. So in some places in the document, like with names that map to containers and names that map to components it says something like "If Available" indicating that if an IP address is not assigned to the individual container no mapping will be made. Am I interpreting that correctly? Are there situations where you would just return the IP Address of the node the container is running on? Am I just mistaken in my interpretation and there are different situations where we could launch a container that would have no IP address available. However for the per application records there is no such conditional. Does that mean that we will return records for any service API no matter how the IP Addresses are assigned, or there is no way for the IP Address to not be available? Also I am not super familiar with the slider registry so perhaps you could clarify a few things there too. How is authentication with zookeeper handled? Is it always SASL+kerberos? Just because the doc mentions that the RM has to set up the base user directory with permissions. Would then any secure slider app that wants to use the registry be required to ship a keytab with their application? Also I am not super familiar with the existing registry API, from the example in the doc it shows a few different types of services that an Application Master can register. Both Host/Port and URI. Would we be exposing SRV records for both of these combinations? If so how would they be named? I am also curious about limits to various DNS fields both in the protocol and in practice with common implementations. I am not an expert on DNS so if I say something silly after you stop laughing please let me know. The document talks a lot about doing character remapping and having to have unique application names, but it does not talk about limits to the lengths of those names (I have seen some DNS servers don't support more then 254 character names). What about limits on the number of IP addresses that can be returned for a given name. I could not find anything specific but I have to assume that in practice most systems don't support a huge number of these, and large clusters on YARN can easily launch hundreds or even thousands of containers for a given service. In addition to Allen's concerns the document does not seem to address/call out my initial concerns about requiring mutual authentication, or handling of port availability in scheduling. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193769#comment-15193769 ] Robert Joseph Evans commented on YARN-4757: --- [~aw], I am not expert on DNS so it is good to hear that you have thought through this and done your homework. I read up a little on SRV records and it looks like a good fit. It still does not change the need for 2 way authentication and making sure that we can restrict who registers for a service, but because SRV records are not a drop in replacement for A/CNAME records it should not be as big of an issue. Clients are likely going to need to make changes to support SRV records, and from what I can tell java does not come with built in support, not the end of the world, but also likely non-trivial. Especially when it looks like the industry has not decided on how they want to support http. (Although I could be wrong on all of that, because like I said I am not an expert here) I just want to be sure that you are thinking things through, and it looks like you are so I am happy. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193577#comment-15193577 ] Robert Joseph Evans commented on YARN-4757: --- I am +1 on the idea of using DNS for long lived service discovery, but we need to be very very careful about security. If we are not all of the problems possible with https://en.wikipedia.org/wiki/DNS_spoofing would likely be possible with this too. We need to be positive that we can restrict the names allowed so there are no conflicts with other servers on the network/internet. Additionally if we make this super simple, which is the entire goal here, then we are covering up some really potentially serious issues with client code, that a normal server running off YARN would not expect to have. It really comes down to any service running on YARN that wants to be secure needs to have 2 way authentication client authenticates server and server authenticates clients. There are timing attacks and other things that can happen when a process crashes and lets go of a port. Internal web services especially feel vulnerable because unless you enable SSL they will be insecure, something that many groups avoid on internal services because of the extra overhead of doing encryption. Do you plan on handling ephemeral ports in some way? As far as I know there is no standard for including port(s) in a DNS entry. If we do come up with something that is non-standard doesn't that still necessitate client side changes which was an expressed goal of this JIRA? If we don't handle ephemeral ports are we going to add in mesos-like scheduling of ports? > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3605) _ as method name may not be supported much longer
[ https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548375#comment-14548375 ] Robert Joseph Evans commented on YARN-3605: --- This is not a newbie issue. The code that has the _ method in it is generated code, and the code that generates it is far from simple. This is also technically a backwards incompatible change, because other YARN applications could be using it. _ as method name may not be supported much longer - Key: YARN-3605 URL: https://issues.apache.org/jira/browse/YARN-3605 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Joseph Evans I was trying to run the precommit test on my mac under JDK8, and I got the following error related to javadocs. (use of '_' as an identifier might not be supported in releases after Java SE 8) It looks like we need to at least change the method name to not be '_' any more, or possibly replace the HTML generation with something more standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3605) _ as method name may not be supported much longer
[ https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-3605: -- Labels: (was: newbie) _ as method name may not be supported much longer - Key: YARN-3605 URL: https://issues.apache.org/jira/browse/YARN-3605 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Joseph Evans I was trying to run the precommit test on my mac under JDK8, and I got the following error related to javadocs. (use of '_' as an identifier might not be supported in releases after Java SE 8) It looks like we need to at least change the method name to not be '_' any more, or possibly replace the HTML generation with something more standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer
[ https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-644: - Labels: (was: BB2015-05-RFC) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer - Key: YARN-644 URL: https://issues.apache.org/jira/browse/YARN-644 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Omkar Vinit Joshi Assignee: Varun Saxena Priority: Minor Attachments: YARN-644.001.patch, YARN-644.002.patch, YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch I see that validation/ null check is not performed on passed in parameters. Ex. tokenId.getContainerID().getApplicationAttemptId() inside ContainerManagerImpl.authorizeRequest() I guess we should add these checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3605) _ as method name may not be supported much longer
Robert Joseph Evans created YARN-3605: - Summary: _ as method name may not be supported much longer Key: YARN-3605 URL: https://issues.apache.org/jira/browse/YARN-3605 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Joseph Evans I was trying to run the precommit test on my mac under JDK8, and I got the following error related to javadocs. (use of '_' as an identifier might not be supported in releases after Java SE 8) It looks like we need to at least change the method name to not be '_' any more, or possibly replace the HTML generation with something more standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer
[ https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534803#comment-14534803 ] Robert Joseph Evans commented on YARN-644: -- Thanks [~varun_saxena], I agree with [~gtCarrera9] +1. I'll check this in. Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer - Key: YARN-644 URL: https://issues.apache.org/jira/browse/YARN-644 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Omkar Vinit Joshi Assignee: Varun Saxena Priority: Minor Attachments: YARN-644.001.patch, YARN-644.002.patch, YARN-644.03.patch, YARN-644.04.patch, YARN-644.05.patch I see that validation/ null check is not performed on passed in parameters. Ex. tokenId.getContainerID().getApplicationAttemptId() inside ContainerManagerImpl.authorizeRequest() I guess we should add these checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534957#comment-14534957 ] Robert Joseph Evans commented on YARN-3148: --- The changes look fine to me. Not sure why the patch could not apply. The queue is full right now, so I will try to run things manually. allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-3148: -- Labels: (was: BB2015-05-RFC) allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-3148: -- Labels: BB2015-05-RFC (was: ) allow CORS related headers to passthrough in WebAppProxyServlet --- Key: YARN-3148 URL: https://issues.apache.org/jira/browse/YARN-3148 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Prakash Ramachandran Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3148.001.patch, YARN-3148.02.patch, YARN-3148.03.patch currently the WebAppProxyServlet filters the request headers as defined by passThroughHeaders. Tez UI is building a webapp which using rest api to fetch data from the am via the rm tracking url. for this purpose it would be nice to have additional headers allowed especially the ones related to CORS. A few of them that would help are * Origin * Access-Control-Request-Method * Access-Control-Request-Headers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058745#comment-14058745 ] Robert Joseph Evans commented on YARN-2261: --- +1 either approach seems fine to me. Vinod's requires an opt in, which is nice from a backwards compatibility standpoint. Also do we want to count the cleanup container as a running application? We definitely need to count its resources against any queue it is a part of, but for a queue that is configured to run mostly large applications, it could have other applications back up behind the cleanup containers. YARN should have a way to run post-application cleanup -- Key: YARN-2261 URL: https://issues.apache.org/jira/browse/YARN-2261 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli See MAPREDUCE-5956 for context. Specific options are at https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059060#comment-14059060 ] Robert Joseph Evans commented on YARN-2261: --- Yes and that is not necessarily a good thing. Especially if cleanup can take a relatively long period of time. YARN should have a way to run post-application cleanup -- Key: YARN-2261 URL: https://issues.apache.org/jira/browse/YARN-2261 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli See MAPREDUCE-5956 for context. Specific options are at https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055588#comment-14055588 ] Robert Joseph Evans commented on YARN-611: -- Why is the reset policy created on a per app ATTEMPT basis? Shouldn't it be on a per application basis. Wouldn't having more then one WindowsSlideAMRetryCountResetPolicy per application be a waste as they will either be running in parallel racing with each other, or there will be extra overhead to stop and start them for each application attempt? Inside WindowsSlideAMRetryCountResetPolicy you create a new Timer. Timer instances create a new thread, I am not sure we really need a new thread for potentially each application, just so the thread can wakeup every few seconds to reset a counter. Inside WindowsSlideAMRetryCountResetPolicy.amRetryCountReset we call rmApp.getCurrentAppAttempt() in a loop. Why don't we cache it? I also don't really like how the code handles locking. To me it always feels bad to hold a lock while calling into a class that may call back into you, especially from a different thread. The WindowsSlideAMRetryCountResetPolicy calls into getAppAttemptId, shouldCountTowardsMaxAttemptRetry, mayBeLastAttempt, and setMaybeLastAttemptFlag of RmAppAttemptImpl. RmAppAttemptImpl calls into start, stop, and recover for the resetPolicy. Right now I don't think there are any potential deadlocks because RmAppAttemptImpl never holds a lock while interacting directly with resetPolicy, but if it ever does then it could deadlock. I'm not sure of a good way to fix this, except perhaps through comments in the ResetPolicy interface specifying that start/stop/recover will never be called while holding a lock for RMAppAttempt or RMApp. Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054168#comment-14054168 ] Robert Joseph Evans commented on YARN-611: -- Why are you using java serialization for the retry policy? There are too many problems with java serialization, especially if we are persisting it into a DB, like the state store. Please switch to using something like protocol buffers that will allow for forward/backward compatible modifications going forward. in the javadocs for RMApp.setRetryCount it would be good to explain what retry count actually is and does. In the constructor for RMAppAttemptImpl there is special logic to call setup only for WindowsSlideAMRetryCountResetPolicy. This completely loses the abstraction that the AMResetCountPolicy interface should be providing. Please update the interface so that you don't need special case code for a single implementation. In RMAppAttemptImpl you mark setMaybeLastAttemptFlag as Private this really should have been done in the parent interface. In the parent interface you also add in myBeLastAttempt() this too should be marked as Private and both of them should have comments that these are for testing. Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030643#comment-14030643 ] Robert Joseph Evans commented on YARN-2140: --- We are working on similar things for storm. I am very interested in your design, because for any streaming system to truly have a chance on YARN soft guarantees on network I/O are critical. There are several big problems with network I/O even if the user can effectively estimate what they will need. The first is that the resource is not limited to a single node in the cluster. The network has a topology and a bottlekneck can show up at any point in that topology. So you may think you are fine because each node in a rack is not scheduled to be using the full bandwidth that the network card(s) can support. But you can easily have saturated the top of rack switch without knowing it. To solve this problem you effectively have to know the topology of the application itself. So that you can schedule the node to node network connections within that application. if users don't know how much network they are going to use at a high level, they will never have any idea at a low level. But then you also have the big problem of batch being very bursty in its network usage. The only way to solve this is going to require network hardware support for prioritizing packets. But I'll wait for your design before writing too much more. Add support for network IO isolation/scheduling for containers -- Key: YARN-2140 URL: https://issues.apache.org/jira/browse/YARN-2140 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872300#comment-13872300 ] Robert Joseph Evans commented on YARN-1530: --- I agree that we need to think about load and plan for something that can handle at least 20x the current load but preferably 100x. However, I am not that sure that the load will be a huge problem at least for current MR clusters. We have seen very large jobs as well, but 700 MB history file job does not finish instantly. I took a look at a 3500 node cluster we have that is under fairly heavy load, and looking at the done directory for yesterday, I saw what amounted to about 1.7MB/sec of data on average. Gigabit Ethernet should be able to handle 15 to 20 times this (assuming that we read as much data as we write, and that the storage may require some replication). I am fine with the proposed solution by [~lohit] so long as the history service always provides a restful interface and the AM can decide if it wants to use it, or go through a different higher load channel. Otherwise non-java based AMs would not necessarily be able to write to the history service. I am also a bit nervous about using the history service for recovery or as a backend for the current MR APIs if we have a pub/sub system as a link between the applications and the history service. I don't think it is a show stopper, it just opens the door for a number of corner cases that will have to be dealt with, like an MR AM crashes badly and the client goes to the history service to get the counters/etc, when does the history service know that all of the events for the MR AM have been processed so it can return those counters, or perhaps other data? I am not totally sure what data may be a show stopper for this, but the lag means all applications have to be sure that they don't use the history service for split brain problems or things like that. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: application timeline design-20140108.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855707#comment-13855707 ] Robert Joseph Evans commented on YARN-321: -- The way it currently works is based off of group permissions on a directory (this is from memory from a while ago so I could be off on a few things). In HDFS when you create a file the group of the file is the group of the directory the file is a part of, similar to the sticky bit on a directory in Linux. When an MR job completes it will copy it's history log file, along with a few other files, to a drop box like location called intermediate done and atomically rename it from a temp name to the final name. The directory is world writable, but only readable by a special group that the history server is a part of, but general users are not. The history server then wakes up periodically and will scan that directory for new files, when it sees new files it will move them to a final location that is owned by the headless history server user. If a query comes in for a job that the history server is not aware of, it will also scan the intermediate done directory before failing. Reading history data is done through RPC to the history server, or through the web interface, including RESTful APIs. There is no supported way for an app to read history data directly though the file system. I hope this helps. Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Vinod Kumar Vavilapalli Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805618#comment-13805618 ] Robert Joseph Evans commented on YARN-941: -- That sounds like a great default. I would like to also have a way for an AM to say I can handle updating tokens without being shot, but that may be something that shows up in a follow on JIRA. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790555#comment-13790555 ] Robert Joseph Evans commented on YARN-321: -- I like the diagrams, but I want to understand if the generic application history service is intended to replace the job history server, or to just augment it? I would prefer it if we could replace the current server. Perhaps not in the first release, but eventually. To make that work we would have to provide a way for MR specific code to come up and run inside the service, exposing both the current restful web service, an application specific UI, and the RPC server that we currently run. Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Vinod Kumar Vavilapalli Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-913: - Attachment: RegistrationServiceDetails.txt Uploading a file that shows some examples of the registration service APIs. Any feedback on them is appreciated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Robert Joseph Evans Attachments: RegistrationServiceDetails.txt In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans reassigned YARN-913: Assignee: Robert Joseph Evans Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Robert Joseph Evans In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786299#comment-13786299 ] Robert Joseph Evans commented on YARN-913: -- Yes it does have plenty of races. I'll try to get some detailed designs up shortly but at a high level the general idea is to have a restful web service. For the most common use case there just needs to be two interfaces. - Register/Monitor a Service - Query for Services Part of the reason we need the service registry is to securely verify that a client is talking to the real service, and no one has grabbed the service's port after it registered. To do that I want to have the concept of a verified service. For that we would need an admin interface for adding, updating, and removing verified services. The registry would provide a number of pluggable ways for services to authenticate. Part of adding a verified service would include indicating which authentication models the service can use to register and which users are allowed to register that service. The registry could also act like a trusted Certificate Authority. Another part of adding in a verified service would include indicating how clients could verify they are talking to the true service. This could include just publishing an application id so the client can go to the RM and get a delegation token. Another option would be having the service generate a public/private key pair. When the service registers it would get the private key and the public key would be available through the discovery interface. The plan is to also have the registry monitor the service similar to ZK. The service would heartbeat in to the registry periodically (could be on the order of mins depending on the service) after a certain period of time of inactivity the service would be removed from the registry. Perhaps we should add in an explicit unregister as well. I want to make sure that the data model it is generic enough that we could support something like a web service on the gird where each server can register itself and all of them would show up in the registry, so a service could have one or more servers that are a part of it, and each server could have some separate metadata about it. I also want to have a plug-in interface for discovery, so we could potentially make the registry look like a DNS server or an SSL Certificate Authority which would make compatibility with existing applications and clients a lot simpler. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Robert Joseph Evans In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785556#comment-13785556 ] Robert Joseph Evans commented on YARN-624: -- [~curino] Sorry about the late reply. I have not really tested this much with storm on YARN. Most of our experiments it is negligible the amount of time it takes to get nodes. But we have not really done anything serious with it, and adding new nodes right now is a manual operation. Support gang scheduling in the AM RM protocol - Key: YARN-624 URL: https://issues.apache.org/jira/browse/YARN-624 Project: Hadoop YARN Issue Type: Sub-task Components: api, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Per discussion on YARN-392 and elsewhere, gang scheduling, in which a scheduler runs a set of tasks when they can all be run at the same time, would be a useful feature for YARN schedulers to support. Currently, AMs can approximate this by holding on to containers until they get all the ones they need. However, this lends itself to deadlocks when different AMs are waiting on the same containers. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754787#comment-13754787 ] Robert Joseph Evans commented on YARN-896: -- I agree that providing a good way handle stdout and stderr is important. I don't know if I want the NM to be doing this for us though, but that is an implementation detail that we can talk about on the follow up JIRA. Chris, feel free to file a JIRA for rolling of stdout and stderr and we can look into what it will take to support that properly. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743819#comment-13743819 ] Robert Joseph Evans commented on YARN-896: -- [~criccomini], That is a great point. To do this we need the application to somehow inform YARN that it is a long lived application. We could do this either through some sort of metadata that is submitted with the application to YARN, possibly through the service registry, or even perhaps just setting the progress to a special value like -1. I think I would prefer the first one, because then YARN could use that metadata later on for other things. After that the UI change should not be too difficult. If you want to file a JIRA for it, either as a sub task or just link it in, that would be great. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741170#comment-13741170 ] Robert Joseph Evans commented on YARN-810: -- Sorry I am a bit late to this discussion. I don't like the config to be global. I think it needs to be on a per container basis. {quote}There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available.{quote} The question is are there ever two different applications running on the same cluster where it is desirable for one, and not for the other. I believe that is true. I argued this in YARN-102 where you want to measure how long an application will take to run under a specific CPU resource request. If I allow it to go over I will never know how long it would take worst case, and so I will never know if my config is correct unless I can artificially limit it. But in production I don't want to run worst case every time, and I don't want a special test cluster to see what the worst case is. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741518#comment-13741518 ] Robert Joseph Evans commented on YARN-1024: --- I am fine with that too. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739853#comment-13739853 ] Robert Joseph Evans commented on YARN-1024: --- {quote}Sorry for the longwindedness.{quote} From what people have told me you still have a long ways to go before you approach me for longwindedness :). My initial gut reaction is that only having two numbers to express the request seems too simplified, but the more I think about it the more I am OK with it, although I think I would change the numbers to be total YCUs requested and minimum YCUs per core. This gives the user better viability into how the scheduler is treating these numbers so they can better reason about them. The total YCUs is the value used for scheduling. The minimum YCUs per core is compared to the maxComputeUnitsPerCore like was suggested to reject a request as not possible, or in the case of a heterogeneous environment restrict the hosts that this container can run on. Although I am OK with the original proposal too. I would also like us to have a flag that would either limit the container to the requested CPU and let it have no more even when more is available, or would let it expand to use whatever CPU was free, but would be guaranteed to get at least the YCUs requested. This is likely something that would have to be done on a separate JIRA though. Without this I don't see a way to really get simplicity, predictability, or consistency. 1 MB of RAM is fairly simple to understand. It can be measured without too much of a problem just by running the process. Most user do a simple search for the correct value run with the default, if it does not work I increase the amount and run again. 1 YCU is very complex to measure for an application. If I cannot restrict a container to never use more than what was requested I cannot consistently predict how long it will take to run later. Without this I don't know how to answer the question I know will come up. What should I set these values to? Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736894#comment-13736894 ] Robert Joseph Evans commented on YARN-624: -- From my perspective it does not really solve the problem for me. It comes close but is not perfect. I am interested in gang scheduling to support [storm on yarn|https://github.com/yahoo/storm-yarn/] The biggest issue I have with this design is knowing the size before the application is launched. The ultimate goal with storm is to have a system where multiple separate, but related, storm topologies are processing data using the same application. We would configure the queues so that if storm sees a spike in demand it can steal containers from batch processing to grow a topology and when the spike goes away it would release those containers back. If the number of containers changes dynamically, by both submitting new topologies and growing/shrinking existing ones it is impossible to tell YARN what I need at the beginning. Gang scheduling is interesting for me because there is a specific number of containers that each topology is configured to need when that topology is launched. Without all of those containers there is no reason to launch a single part of the topology. I can see this happening with a modification to your approach where the all or nothing happens when the AM submits a request, and not when the AM is submitted. I also have a hard time seeing how this would work well with other advanced features like preemption. For preemption to work well with gang scheduling it needs to take into account that if it shoots a container in a gang of containers it is likely going to get back a lot more resources then just one container. If it is aware of this then it can still shoot the container, but avoid shooting other containers needlessly because it knows what it is going to get back. Support gang scheduling in the AM RM protocol - Key: YARN-624 URL: https://issues.apache.org/jira/browse/YARN-624 Project: Hadoop YARN Issue Type: Sub-task Components: api, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Per discussion on YARN-392 and elsewhere, gang scheduling, in which a scheduler runs a set of tasks when they can all be run at the same time, would be a useful feature for YARN schedulers to support. Currently, AMs can approximate this by holding on to containers until they get all the ones they need. However, this lends itself to deadlocks when different AMs are waiting on the same containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736919#comment-13736919 ] Robert Joseph Evans commented on YARN-1024: --- Perhaps I am missing something here. The goals Arun has asked for are simplicity, predictability, and consistency. Simplicity I totally agree with, but I do not totally agree with always having predictability and consistency after simplicity, and I do not agree that they are always required. These two come with a trade-off with utilization, and this is something that Sandy brought up, although not directly. For HBase guaranteed resources, in terms of both parallelism and raw CPU speed are important because it is using those to provide a service where predictability and consistency are needed. If the HBase AM cannot truly express to YARN what it needs because of simplicity HBase on YARN will not be used, because it will not behave the way users need/expect it to. Similarly if HBase is allowed to steal resources from others you can easily request too little resources on an underutilized cluster and when the cluster is under load it falls apart. This is similar for me with my desire for Storm on YARN. I am happy to use a complex API to express my needs if it means that I get what I need. On the other hand, if I am doing MR batch processing most of the time (but not all of it) I am doing single threaded processing and I really just want it to fill in the gaps and use as much unused CPU as it can. Yes, some MR jobs have strict SLAs but most do not and it is best if we can provide a scheduler that can balance both. I also don't agree that because YARN lacks the ability to schedule everything that impacts performance, including network and disk IO, that we should skip doing CPU correctly. Some applications are truly CPU bound and they will benefit. For other resources we can add them to YARN as they are needed until we do meet the goal of predictability and consistency. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734983#comment-13734983 ] Robert Joseph Evans commented on YARN-896: -- Sorry I have not responded sooner. I have been out on vacation and had a high severity issue that has consumed a lot of my time. [~lmccay] and [~thw] There are many different services that long lived processes need to communicate with. Many of these services use tokens and others may not. Each of these tokens or other credentials are specific to the services being accessed. In some cases like with HBase we probably can take advantage of the existing renewal feature in the RM. With other tokens or credentials it may be different, and may require AM specific support for them. I am not really that concerned with solving the renewal problem for all possible credentials here, although if we can solve this for a lot of common tokens at the same time that would be great. What I care most about is being sure that a long lived YARN application does not necessarily have to stop and restart because an HDFS token cannot be renewed any longer. If there are changes going into the HDFS security model that would make YARN-941 unnecessary that is great. I have not had much time to follow the security discussion so thank you for pointing this out. But it is also a question of time frames. YARN-941 and YARN-1041 would allow for secure, robust, long lived applications on YARN, and do not appear to be that difficult to accomplish. Do you know the time frame for the security rework? Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713692#comment-13713692 ] Robert Joseph Evans commented on YARN-896: -- [~thw] I am not totally sure what you mean by app specific tokens. Is this tokens that the app is going to use to connect to other services like HBase? or is it something else? [~eric14] and [~enis] Rolling upgrades is a very interesting use case. We can definitely add in a ticket to support this type of thing. I agree that it needs to be thought through some, and is going to require help from both the AM and YARN to do it properly. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-941) RM Should have a way to update the tokens it has for a running application
Robert Joseph Evans created YARN-941: Summary: RM Should have a way to update the tokens it has for a running application Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713961#comment-13713961 ] Robert Joseph Evans commented on YARN-941: -- I am punting on how/if we get the new HDFS token to NMs to be used for log aggregation. We need to think a bit more about how logs should be handled for long lived services before we spend a lot of time trying to make log aggregation work. RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713992#comment-13713992 ] Robert Joseph Evans commented on YARN-896: -- I filed one new JIRA for updating tokens in the RM YARN-941. I started to file a JIRA for the AM to be informed of the location of its already running containers, but as I was writing it I realized that it will not give us enough information to be able to reattach to the containers. The only thing it will give us is enough info to be able to go shoot the containers. Simply because there is no metadata about what port the container may be listening on or anything like that. It seems to me that we would be better off keeping a log, similar to the MR job history log, that has in it all the data the AM needs to look for running containers. If others see a different need for this API, I am still happy to file a JIRA for it. I have not filed a JIRA for anti-affinity yet either. I seem to remember another JIRA for something like this already, but I have not found it yet. I figure we can add in a long lived process flag for the scheduler when we run across a use case for it. The other parts discussed here, either already have a JIRA associated with the same functionality, or I think need a bit more discussion about exactly what we want to do. Namely log aggregation/processing and Hadoop package management/rolling upgrades of live applications. If I missed something please let me know. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704623#comment-13704623 ] Robert Joseph Evans commented on YARN-896: -- Chris, Yes I missed the app master retry issue. Those two with the discussion on them seem to cover what we are looking for. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703622#comment-13703622 ] Robert Joseph Evans commented on YARN-896: -- No comments in the past few days. I would like to hear from more people involved, even if it is just to say that it looks like we have everything covered here. Then we can start filing JIRAs and getting some work done. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698500#comment-13698500 ] Robert Joseph Evans commented on YARN-896: -- During the most recent Hadoop Summit there was a developer meetup where we discussed some of these issues. This is to summarize what was discussed at that meeting and to add in a few things that have also been discussed on mailing lists and other places. HDFS delegation tokens have a maximum life time. Currently tokens submitted to the RM when the app master is launched will be renewed by the RM until the application finishes and the logs from the application have finished aggregating. The only token currently used by the YARN framework is the HDFS delegation token. This is used to read files from HDFS as part of the distributed cache and to write the aggregated logs out to HDFS. In order to support relaunching an app master after the HDFS the maximum lifetime of the HDFS delegation token, we either need to allow for tokens that do not expire or provide an API to allow the RM to replace the old token with a new one. Because removing the maximum lifetime of a token reduces the security of the cluster as a whole I think it would be better to provide an API to replace the token with a new one. If we want to continue supporting log aggregation we also need to provide a way for the Node Managers to get the new token too. It is assumed that each app master will also provide an API to get the new token so it can start using it. Log aggregation is another issue, although not required for long lived applications to work. Logs are aggregated into HDFS when the application finishes. This is not really that useful for applications that are never intended to exit. Ideally the processing of logs by the node manager should be pluggable so that clusters and applications can select how and when logs are processed/displayed to the end user. Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed. Another issue is to allow containers to out live the app master that launched them and also to allow containers to outlive the node manager that launched them. This is especially critical for the stability of applications durring rolling upgrades to YARN. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698505#comment-13698505 ] Robert Joseph Evans commented on YARN-896: -- Another issue that has been discussed in the past is the impact that long lived processes can have on resource scheduling. It is possible for a long lived process to grab lots of resources and then never release them even though it is using more resources than it would be allowed to have when the cluster is full. Recent preemption changes should be able to prevent this from happening between different queues/pools, but we may need to think if we need more control about this within a queue. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662101#comment-13662101 ] Robert Joseph Evans commented on YARN-624: -- I would love to have it right now for storm too. If you want me to sign up as a use case I am happy to. Support gang scheduling in the AM RM protocol - Key: YARN-624 URL: https://issues.apache.org/jira/browse/YARN-624 Project: Hadoop YARN Issue Type: Sub-task Components: api, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Per discussion on YARN-392 and elsewhere, gang scheduling, in which a scheduler runs a set of tasks when they can all be run at the same time, would be a useful feature for YARN schedulers to support. Currently, AMs can approximate this by holding on to containers until they get all the ones they need. However, this lends itself to deadlocks when different AMs are waiting on the same containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems
[ https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662105#comment-13662105 ] Robert Joseph Evans commented on YARN-690: -- Vinod, Yes creating and resolving a JIRA in 2 hours is not ideal, but this is a Blocker that consisted on only a handful of lines of change, also the bylaws explicitly state that a waiting period is not needed for this vote because committers can retroactively -1 and pull the change out. I agree that waiting to let others look at the code is good and if it were not a Blocker I would have waited. RM exits on token cancel/renew problems --- Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 3.0.0, 2.0.5-beta, 0.23.8 Attachments: YARN-690.patch, YARN-690.patch The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662352#comment-13662352 ] Robert Joseph Evans commented on YARN-624: -- Storm is a real-time stream processing system. We are working on porting this to run on YARN. Storm will process one or more streams of data using a logical DAG of processing nodes called a topology. This topology runs in spawned processes. If there are not enough processes to run a topology there is no point in launching any of the processes. Hence the need for gang scheduling. It is a very simple gang scheduling use case currently. When a new topology is submitted we want to request enough resources to to run that topology. If a node goes down, we are going to request enough resources to replace it, so we can get up and running again ASAP. When a topology is killed we want to release those resources. Long term we would like to make sure that the different containers are close to each other from a network topology perspective. We don't care which node or rack the containers are on, but we do care that they are all on the same node/rack as the other containers. Support gang scheduling in the AM RM protocol - Key: YARN-624 URL: https://issues.apache.org/jira/browse/YARN-624 Project: Hadoop YARN Issue Type: Sub-task Components: api, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Per discussion on YARN-392 and elsewhere, gang scheduling, in which a scheduler runs a set of tasks when they can all be run at the same time, would be a useful feature for YARN schedulers to support. Currently, AMs can approximate this by holding on to containers until they get all the ones they need. However, this lends itself to deadlocks when different AMs are waiting on the same containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems
[ https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660001#comment-13660001 ] Robert Joseph Evans commented on YARN-690: -- I don't think this does what you want. Now an IOException will cause the same issue. I think you need to handle runtime and IOException separately. RM exits on token cancel/renew problems --- Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-690.patch The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-690) RM exits on token cancel/renew problems
[ https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660035#comment-13660035 ] Robert Joseph Evans commented on YARN-690: -- The change looks fine to me now. +1 RM exits on token cancel/renew problems --- Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-690.patch, YARN-690.patch The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645991#comment-13645991 ] Robert Joseph Evans commented on YARN-528: -- The approach seems OK to me, but I would rather have the impl be an even thinner wrapper. {code} private ApplicationIdProto proto = null; private ApplicationIdProto.Builder builder = null; ApplicationIdPBImpl(ApplicationIdProto proto) { this.proto = proto; } public ApplicationIdPBImpl() { this.builder = ApplicationIdProto.newBuilder(); } public ApplicationIdProto getProto() { assert (proto != null); return proto; } @Override public int getId() { assert (proto != null); return proto.getId(); } @Override protected void setId(int id) { assert (builder != null); builder.setId((id)); } @Override public long getClusterTimestamp() { assert(proto != null); return proto.getClusterTimestamp(); } @Override protected void setClusterTimestamp(long clusterTimestamp) { assert(builder != null); builder.setClusterTimestamp((clusterTimestamp)); } @Override protected void build() { assert(builder != null); proto = builder.build(); builder = null; } {code} Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: y528_AppIdPart_01_Refactor.txt, y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt, y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644784#comment-13644784 ] Robert Joseph Evans commented on YARN-528: -- Thanks for doing this Sid. I started pulling on the string and there was just too much involved, so I had to stop. Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: y528_AppIdPart_01_Refactor.txt, y528_AppIdPart_02_AppIdChanges.txt, y528_AppIdPart_03_fixUsage.txt, y528_ApplicationIdComplete_WIP.txt, YARN-528.txt, YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621023#comment-13621023 ] Robert Joseph Evans commented on YARN-528: -- OK, I understand now. I will try to find some time to play around with getting the AM ID to not have a wrapper at all. Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: YARN-528.txt, YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-528) Make IDs read only
Robert Joseph Evans created YARN-528: Summary: Make IDs read only Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Improvement Reporter: Robert Joseph Evans I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-528: - Attachment: YARN-528.txt This patch contains changes to both Map/Reduce IDs as well as YARN APIs. I don't really want to split them up right now, but I am happy to file a separate JIRA for tracking purposes if the community decides this is a direction we want to go in. Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Improvement Reporter: Robert Joseph Evans Attachments: YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans reassigned YARN-528: Assignee: Robert Joseph Evans Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Improvement Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619911#comment-13619911 ] Robert Joseph Evans commented on YARN-528: -- The build failed, because it needs to be upmerged, again :( Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Improvement Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-528: - Attachment: YARN-528.txt Upmerged Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Improvement Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: YARN-528.txt, YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-528) Make IDs read only
[ https://issues.apache.org/jira/browse/YARN-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620326#comment-13620326 ] Robert Joseph Evans commented on YARN-528: -- I am fine with splitting the MR changes from the YARN change like I said, I put this out here more to be a question of how do we want to go about implementing theses changes, and the test was more of a prototype example. I personally lean more towards using the *Proto classes directly. Why have something else wrapping it if we don't need it, even if it is a small and simple layer. The only reason I did not go that route here is because of toString(). With the IDs we rely on having ID.toString() turn into something very specific that can be parsed and turned back into an instance of the object. If I had the time I would trace down all places where we call toString on them and replace it with something else. I may just scale back the scope of the patch to look at ApplicationID to begin with and try to see if I can accomplish this. bq. Wrapping the object which came over the wire - with a goal of creating fewer objects. I really don't understand how this is supposed to work. How do we create fewer objects by wrapping them in more objects? I can see us doing something like deduping the objects that come over the wire, but I don't see how wrapping works here. Make IDs read only -- Key: YARN-528 URL: https://issues.apache.org/jira/browse/YARN-528 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Attachments: YARN-528.txt, YARN-528.txt I really would like to rip out most if not all of the abstraction layer that sits in-between Protocol Buffers, the RPC, and the actual user code. We have no plans to support any other serialization type, and the abstraction layer just, makes it more difficult to change protocols, makes changing them more error prone, and slows down the objects themselves. Completely doing that is a lot of work. This JIRA is a first step towards that. It makes the various ID objects immutable. If this patch is wel received I will try to go through other objects/classes of objects and update them in a similar way. This is probably the last time we will be able to make a change like this before 2.0 stabilizes and YARN APIs will not be able to be changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618784#comment-13618784 ] Robert Joseph Evans commented on YARN-515: -- Having people always test the patches in secure mode I think is a bit too high of a barrier for some. I personally hate having to get it all set up to be able to test a patch. Registration responses in general were broken. The NM would never get a reboot signal either. It was always the default enum value of everything is fine. I am just glad that we caught it. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Priority: Blocker Fix For: 2.0.5-beta Attachments: YARN-515.txt On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617367#comment-13617367 ] Robert Joseph Evans commented on YARN-112: -- Vinod, I just glanced at the latest patch, I did not read it in detail, so if you say it covers that case I trust you. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617373#comment-13617373 ] Robert Joseph Evans commented on YARN-515: -- This issue appears to be cause by a bug in RegisterNodeManagerResponsePBImpl. I think specifically it was caused by YARN-440. I have a unit test that can reproduce it. Sid reviewed YARN-440 and he is a really smart guy. I looked at it thinking that it must be the cause of the issue and I didn't see anything in there that was off. I just think all this extra code to try and wrap the protocol buffers is just a bad idea. It makes things difficult to change a .proto file, and it just slows things down. But it is a lot of work to change it so I am done with my rant now, I'll go find a fix for the issue. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617379#comment-13617379 ] Robert Joseph Evans commented on YARN-515: -- Yes the issue is that there is a rebuild flag in the PBImpl that is never set to true, so it will never rebuild the proto. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-515: - Attachment: YARN-515.txt This should fix the issue. We forgot to tell the wrapper to rebuild after setting some values. There is a unit test included that shows the problem. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker Attachments: YARN-515.txt On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans reassigned YARN-515: Assignee: Robert Joseph Evans Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Priority: Blocker Attachments: YARN-515.txt On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325 ] Robert Joseph Evans commented on YARN-112: -- I agree that scale exposes races but, still the underlying problem is that we want to create a new unique directory. This seems very simple. {code} File uniqueDir = null; do { uniqueDir = new File(baseDir, String.valueOf(rand.nextLong())); } while (!uniqueDir.mkdir()); {code} I don't see why we are going through all of this complexity, simply because a FileContext API is broken. Playing games to make the race less likely is fine. But ultimately we still have to handle the race. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327 ] Robert Joseph Evans commented on YARN-112: -- Oh and the latest patch using a unique number will not always work, because the same code is used from different processes on the same box. We would have to have a way to guarantee uniqueness between the different processes. CurrentTimeMillis helps but still could result in a race. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-515) Node Manager not getting the master key
Robert Joseph Evans created YARN-515: Summary: Node Manager not getting the master key Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616628#comment-13616628 ] Robert Joseph Evans commented on YARN-515: -- OK It actually looks like the NM is trying to get the Master Key, before it ever has set it, which is causing the NPE. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-515) Node Manager not getting the master key
[ https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616698#comment-13616698 ] Robert Joseph Evans commented on YARN-515: -- This is really odd. I put in logging in the ResourceTrackerService and in the NodeStatusUpdaterImpl. The RM sets the secret key in the RegisterNodeManagerResponse, but the NM only sees a null come out for it. Because of that the heartbeat always fails with the NPE trying to read something that was never set. Node Manager not getting the master key --- Key: YARN-515 URL: https://issues.apache.org/jira/browse/YARN-515 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.4-alpha Reporter: Robert Joseph Evans Priority: Blocker On branch-2 the latest version I see the following on a secure cluster. {noformat} 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now 2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of me mory:12288, vCores:16 2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. 2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407) {noformat} The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614036#comment-13614036 ] Robert Joseph Evans commented on YARN-378: -- Hitesh and Vinod, It is not a big deal. I realized that both were going in, and I am glad that this is ready and has gone in. It is a great feature. It just would have been nice to either commit them at the same time, or give a heads up on the mailing list that you were going to break the build for a little while. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Fix For: 2.0.5-beta Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, YARN-378_9.patch, YARN_378-final-commit.patch, YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614402#comment-13614402 ] Robert Joseph Evans commented on YARN-112: -- I am not really sure that we fixed the underlying issue. {code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code} threw an exception because there was something else in that directory already, but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a FileAlreadyExistsException if the directory already exists. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29 files.rename should never get into this situation if files.rename threw the exception when it was supposed to. I tested this and {code} FileContext lfc = FileContext.getLocalFSFileContext(new Configuration()); Path p = new Path(/tmp/bobby.12345); FsPermission cachePerms = new FsPermission((short) 0755); lfc.mkdir(p, cachePerms, false); lfc.mkdir(p, cachePerms, false); {code} never throws an exception. We first need to address the bug in FileContext, and then we can look at how we can make FSDownload deal with mkdir throwing an exception, or whatever the fix ends up being. I filed HADOOP-9438 for this. If the fix ends up being that we do not support throwing the exception in FileContext, then your current solution looks OK. I also have a hard time believing that we are getting random collisions on a long value that should be fairly uniformly distributed. We need to guard against it either way and I suppose it is possible, but if I remember correctly we were seeing a significant number of these errors and my gut tells me that there is either something very wrong with Random, or there is something else also going on here. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans reopened YARN-378: -- Looks like something was missed {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hadoop-mapreduce-client-app: Compilation failure: Compilation failure: [ERROR] /home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[227,52] cannot find symbol [ERROR] symbol : variable RM_AM_MAX_RETRIES [ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration [ERROR] /home/evans/src/commit/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java:[228,25] cannot find symbol [ERROR] symbol : variable DEFAULT_RM_AM_MAX_RETRIES [ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration {noformat} Please fix this ASAP. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Fix For: 2.0.5-beta Attachments: YARN-378_10.patch, YARN-378_11.patch, YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch, YARN-378_8.patch, YARN-378_9.patch, YARN_378-final-commit.patch, YARN-378_MAPREDUCE-5062.2.patch, YARN-378_MAPREDUCE-5062.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives
[ https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608919#comment-13608919 ] Robert Joseph Evans commented on YARN-109: -- The findbugs is complaining that you are ignoring the return value of the delete call. It should not be a problem so either use the return value to log a warning when it fails or update the findbugs filter to filter out the error. The -1 for the test timeouts is caused by a bug in the script used to detect these, so you can either ignore it, or add a timeout to any @Test that appears in the patch file, including the ones you didn't add :(. In the test, please uncomment the lines to cleanup after the test. Are they causing a problem for the test to pass? or was it just for debugging? Also I personally would prefer to have a few small jar/tar/zip files checked into the repository instead of generating them on they fly for the test. It will speed up the test and have less dependencies on the system being set up with the exact commands, i.e. bash for windows support. Although if you don't feel like changing it I am fine with that too, most of those commands are used by FSDownload already so it is not that critical. .tmp file is not deleted for localized archives --- Key: YARN-109 URL: https://issues.apache.org/jira/browse/YARN-109 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3, 2.0.0-alpha Reporter: Jason Lowe Assignee: Mayank Bansal Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch, YARN-109-trunk-3.patch, YARN-109-trunk.patch When archives are localized they are initially created as a .tmp file and unpacked from that file. However the .tmp file is not deleted afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-109) .tmp file is not deleted for localized archives
[ https://issues.apache.org/jira/browse/YARN-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609509#comment-13609509 ] Robert Joseph Evans commented on YARN-109: -- That is fine with me. My concern was mostly with Windows support. tar, zip, jar, etc. should be there, but bash may not be. So if you want to file a new JIRA that is fine, if not you can just wait for windows support to be merged in and see if it breaks. .tmp file is not deleted for localized archives --- Key: YARN-109 URL: https://issues.apache.org/jira/browse/YARN-109 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3, 2.0.0-alpha Reporter: Jason Lowe Assignee: Mayank Bansal Attachments: YARN-109-trunk-1.patch, YARN-109-trunk-2.patch, YARN-109-trunk-3.patch, YARN-109-trunk-4.patch, YARN-109-trunk.patch When archives are localized they are initially created as a .tmp file and unpacked from that file. However the .tmp file is not deleted afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602333#comment-13602333 ] Robert Joseph Evans commented on YARN-378: -- Using the environment variables works for other applications too. That is the only way to get some pieces of critical information that are needed for registration with the RM. On Windows there are limits http://msdn.microsoft.com/en-us/library/windows/desktop/ms682653%28v=vs.85%29.aspx But they should not cause too much of an issue on Windows Server 2008 and above. I would prefer for us to only return the information to the AM one way. Either though thrift or through the environment variable just so there is less confusion, but I am not adamant about it. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602341#comment-13602341 ] Robert Joseph Evans commented on YARN-378: -- Looking at the code too I am fine with renaming retries to attempts. But we need to mark this JIRA as an incompatible change or put in a deprecated config mapping. We are early enough in YARN that deprecating it seems like a waste. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch, YARN-378_7.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-226) Log aggregation should not assume an AppMaster will have containerId 1
[ https://issues.apache.org/jira/browse/YARN-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602661#comment-13602661 ] Robert Joseph Evans commented on YARN-226: -- Big means amount of memory/CPU relative to the minimum allocation size. For example you ask for a 4 GB container with a min allocation size of 500MB. Log aggregation should not assume an AppMaster will have containerId 1 -- Key: YARN-226 URL: https://issues.apache.org/jira/browse/YARN-226 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth In case of reservcations, etc - AppMasters may not get container id 1. We likely need additional info in the CLC / tokens indicating whether a container is an AM or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601237#comment-13601237 ] Robert Joseph Evans commented on YARN-378: -- The patch looks good to me. The only problem I have is with how we are informing the AM of the maximum number of retires that it has. This should work, but it is going to require a lot of changes to the MR AM to use it. Right now the number is used in the init of MRAppMaster, but we will not get that information until start() is called and we register with the RM. I would much rather see a new environment variable added that can hold this information, because it makes MAPREDUCE-5062 much simpler. But I am OK with the way it currently is. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch, YARN-378_5.patch, YARN-378_6.patch, YARN-378_6.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599236#comment-13599236 ] Robert Joseph Evans commented on YARN-237: -- Sorry to keep adding more things in here, but JQuery.java is a generic part of YARN. It, in theory, can be used by others not just Map/Reduce and YARN. Encoding in a special case for the tasks table is not acceptable. You should be able to get the same functionality by switching to the DATATABLES_SELECTOR for those tables. We also need to address the find bugs issues. You are dereferencing type to create the ID of the tasks and type could be null, although in practice it should never be. Also there is no need to call toString() on type when using + with another string. This may fix the find bugs issues too, although it would not be super clean. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash Assignee: jian he Labels: usability Attachments: YARN-237.patch, YARN-237.v2.patch, YARN-237.v3.patch If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599240#comment-13599240 ] Robert Joseph Evans commented on YARN-378: -- I am perfectly fine with that. It seems like more overhead, but I am fine either way. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Attachments: YARN-378_1.patch, YARN-378_2.patch, YARN-378_3.patch, YARN-378_4.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597428#comment-13597428 ] Robert Joseph Evans commented on YARN-378: -- From a quick look it seems OK. It would be nice for isLastAMRetry to remain private and have a getter. That way it prevents unintended writes to it. I also don't really like having the AM guess how many retries there will be. I thought it was ugly when I add that code, and now that it logic is more complex I really know why. Could you please file a JIRA so the RM and inform the AM how many AM retires it has, or if you have time just add it in as part of this JIRA. That way the AM will never have to adjust its logic again. Also could we make the code a little more robust. In both the AM and the RM instead of checking for just -1 could you check for anything that is = 0. If anyone sets the retries to be that small it should use the default. I am not sure what having a max retries of -2 means and what it would do to an application. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability Attachments: YARN-378_1.patch We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597508#comment-13597508 ] Robert Joseph Evans commented on YARN-237: -- I have a few more comments. It is great that you fixed the issues, but now we have a leak in the browser. You have tied the table ID to the localStorage key, and then for a couple of tables you have included the jobID in the table ID. This means that new entries will be placed in the localStorage for every job page I visit and those entires will never be deleted. I see two ways to fix this. We can ether change it over to be sessionStorage instead of localStorage, because it goes away after the session ends. Or we can remove the jobID from the table names. If we remove the jobID corresponding tables on different pages will share a single state. If we use sessionStorage the data will only be saved for a given browser session. If I close the browser and reopen it the state will be lost. I tend to think the first one is preferable, but that is just me. Also could you please update the code format to meet our guidelines. There are a few places where it does not meet the guidelines. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash Assignee: jian he Labels: usability Attachments: YARN-237.patch, YARN-237.v2.patch If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-456) allow OS scheduling priority of NM to be different than the containers it launches for Windows
[ https://issues.apache.org/jira/browse/YARN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-456: - Summary: allow OS scheduling priority of NM to be different than the containers it launches for Windows (was: Add similar support for Windows) allow OS scheduling priority of NM to be different than the containers it launches for Windows -- Key: YARN-456 URL: https://issues.apache.org/jira/browse/YARN-456 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Bikas Saha Assignee: Bikas Saha -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593736#comment-13593736 ] Robert Joseph Evans commented on YARN-378: -- I don't really want the client config to be called yarn.resourcemanager.am.max-retries. That is a YARN resource manager config, and is intended to be used by the RM, not by the map reduce client. I would much rather have a mapreduce.am.max-retries that the MR client reads and uses to populate the ApplicationSubmissionContext. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593863#comment-13593863 ] Robert Joseph Evans commented on YARN-378: -- But the config *is* specific to mapreduce. Every other application client will have to provide their own way of putting that value into the container launch context. It could be through a hadoop config or it could be through something else entirely. I am in the process of porting Storm to run on top of YARN. I don't see us ever using a Hadoop Configuration in the client except the default one to be able to access HDFS. Storm has its own configuration object and for better integration with Storm I would set up a Storm conf for that, although in reality I would probably just never set it because I never want it to go down entirely, and that is how I would get the maximum number of retries allowed by the cluster. I can see other applications that already exist and are being ported to run on YARN, like OpenMPI, to want to set that config in a way that is consistent with their current configuration and not in a Hadoop specific way. ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Assignee: Zhijie Shen Labels: usability We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590842#comment-13590842 ] Robert Joseph Evans commented on YARN-237: -- localStorage is not per page it is per domain, so that means if two pages in the same domain have tables named the same they will share a config in local storage. So for example if I run a map/reduce job and I sort the map tasks by elapsed time, the reduce tasks will also be sorted by elapsed time when I go to their page. The good news is that if I sort the reduces by an ID that the maps don't know about the maps page just ignores it, but it resets the sorting for the reducers not too. But this produces even stranger behavior in the counters page. Because the counters use a selector, multiple tables on the same page now all share a saved state. So if I sort the counters by a column and then reload all of the counters are now sorted by that column. I am not positive what the best way is to fix these. We want to provide a way for each data table to have a unique storage key across all tables in the domain, even with the selector. We don't want to use the page path or anything like that because that will create a new group of settings per page, and that would result in filling up their localStorage, unless of course we used the sessionStorage instead. But using sessionStorage would mean that each time we opened up a new session we would have to re-do the settings. sessionStorage also does not fix the issue with counters and the selector where we have multiple tables all sharing a single ID and single storage. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash Assignee: jian he Labels: usability Attachments: YARN-237.patch If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590999#comment-13590999 ] Robert Joseph Evans commented on YARN-237: -- The code that Jian wrote is working on the RM as well as the AM, I tested it. The patch changes code that is common to both of them. The issues I mentioned are not theoretical. The reason it works on the AM is because it is not using a cookie, instead it is using an HTML5 concept for local storage. If we want to restrict these to just be for the RM that does seem to fix the issue. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash Assignee: jian he Labels: usability Attachments: YARN-237.patch If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589976#comment-13589976 ] Robert Joseph Evans commented on YARN-237: -- The change looks more or less OK to me. I am not thrilled about how we modify the data table's init string by looking for the first '{', but I think it is OK. I just have a few concerns, and most if it deals with my lack of knowledge about jQuery and localStorage. I know that localStorage is not supported on all browsers. I also know that localStorage can throw a QUOTA_EXCEEDED exception. What happens when we run into these situations? Will the page stop working or will jQuery degrade gracefully and simply not allow us to save the data. What about if the data stored in the key is not what we expect. Will jQuery make the page unusable. We currently have tables with the same name on different pages. If they are not kept in sync there could be some issues with the data that is saved. Which brings up another point I am also a bit concerned about the key we are using as part of the localStorage. The key is the id of the data table. I would prefer it if we could some how make it obvious that these values are for a data table, and not some other apps storage. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash Assignee: jian he Labels: usability Attachments: YARN-237.patch If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-426) Failure to download a public resource on a node prevents further downloads of the resource from that node
[ https://issues.apache.org/jira/browse/YARN-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588424#comment-13588424 ] Robert Joseph Evans commented on YARN-426: -- The patch looks good to me. +1 I'll check it in. Failure to download a public resource on a node prevents further downloads of the resource from that node - Key: YARN-426 URL: https://issues.apache.org/jira/browse/YARN-426 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-426.patch If the NM encounters an error while downloading a public resource, it fails to empty the list of request events corresponding to the resource request in {{attempts}}. If the same public resource is subsequently requested on that node, {{PublicLocalizer.addResource}} will skip the download since it will mistakenly believe a download of that resource is already in progress. At that point any container that requests the public resource will just hang in the {{LOCALIZING}} state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat
[ https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570329#comment-13570329 ] Robert Joseph Evans commented on YARN-371: -- Tom just like Arun said the memory usage changes based off of the size of the cluster vs. the size of the request. The current approach is on the order of the size of the cluster where as the proposed approach is on the order of the number of desired containers. If I have a 100 node cluster and I am requesting 10 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if reducers are included in it. What is more it is probably exactly the same size of request for 1 or even 1000 tasks. Where as the proposed approach would grow without bound as the number of tasks also increased. However, I also agree with Sandy that the current state compression is lossy and as such restricts what is possible in the scheduler. I would like to understand better what the size differences would be for various requests, both in memory and also over the wire. It seems conceivable to me that if the size difference is not too big, especially over the wire, we could allow the scheduler itself to decide on its in memory representation. This would allow for the Capacity Scheduler to keep its current layout and allow for others to experiment with more advanced scheduling options. Different groups could decide which scheduler best fits their needs and workload. If the size is significantly larger I would like to see hard numbers about how much better/worse it makes specific use cases. I am also very concerned about adding too much complexity to the scheduler. We have run into issues where the RM will get very far behind in scheduling because it is trying to do a lot already and eventually OOM as its event queue grows too large. I also don't want to change the scheduler protocol too much without first understanding how that new protocol would impact other potential scheduling features. There are a number of other computing patterns that could benefit from specific scheduler support. Things like gang scheduling where you need all of the containers at once or none of them can make any progress, or where you want all of the containers to be physically close to one another because they are very I/O intensive, but you don't really care where exactly they are. Or even something like HBase where you essentially want one process on every single node with no duplicates. Do the proposed changes make these uses case trivially simple, or do they require a lot of support on the AM to implement them? Consolidate resource requests in AM-RM heartbeat Key: YARN-371 URL: https://issues.apache.org/jira/browse/YARN-371 Project: Hadoop YARN Issue Type: Improvement Components: api, resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Each AMRM heartbeat consists of a list of resource requests. Currently, each resource request consists of a container count, a resource vector, and a location, which may be a node, a rack, or *. When an application wishes to request a task run in multiple localtions, it must issue a request for each location. This means that for a node-local task, it must issue three requests, one at the node-level, one at the rack-level, and one with * (any). These requests are not linked with each other, so when a container is allocated for one of them, the RM has no way of knowing which others to get rid of. When a node-local container is allocated, this is handled by decrementing the number of requests on that node's rack and in *. But when the scheduler allocates a task with a node-local request on its rack, the request on the node is left there. This can cause delay-scheduling to try to assign a container on a node that nobody cares about anymore. Additionally, unless I am missing something, the current model does not allow requests for containers only on a specific node or specific rack. While this is not a use case for MapReduce currently, it is conceivable that it might be something useful to support in the future, for example to schedule long-running services that persist state in a particular location, or for applications that generally care less about latency than data-locality. Lastly, the ability to understand which requests are for the same task will possibly allow future schedulers to make more intelligent scheduling decisions, as well as permit a more exact understanding of request load. I would propose the tweak of allowing a single ResourceRequest to encapsulate all the location information for a task. So instead of just a single location, a
[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling
[ https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570619#comment-13570619 ] Robert Joseph Evans commented on YARN-371: -- I didn't really expect them to be trivial :). So I think that there may be some value in having a different protocol, but we need some hard numbers to be able to really make an informed decision. I would like to see the size of a request in the following table (both in memory size on the RM and size sent over the wire) ||nodes(down)/tasks(across)||1,000||10,000||100,000||500,000|| ||100|?|?|?|?| ||1,000|?|?|?|?| ||4,000|?|?|?|?| ||10,000|?|?|?|?| It would also be great to see in practice how bad is the scheduling problem where the wrong node is sent. Resource-centric compression in AM-RM protocol limits scheduling Key: YARN-371 URL: https://issues.apache.org/jira/browse/YARN-371 Project: Hadoop YARN Issue Type: Improvement Components: api, resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Each AMRM heartbeat consists of a list of resource requests. Currently, each resource request consists of a container count, a resource vector, and a location, which may be a node, a rack, or *. When an application wishes to request a task run in multiple localtions, it must issue a request for each location. This means that for a node-local task, it must issue three requests, one at the node-level, one at the rack-level, and one with * (any). These requests are not linked with each other, so when a container is allocated for one of them, the RM has no way of knowing which others to get rid of. When a node-local container is allocated, this is handled by decrementing the number of requests on that node's rack and in *. But when the scheduler allocates a task with a node-local request on its rack, the request on the node is left there. This can cause delay-scheduling to try to assign a container on a node that nobody cares about anymore. Additionally, unless I am missing something, the current model does not allow requests for containers only on a specific node or specific rack. While this is not a use case for MapReduce currently, it is conceivable that it might be something useful to support in the future, for example to schedule long-running services that persist state in a particular location, or for applications that generally care less about latency than data-locality. Lastly, the ability to understand which requests are for the same task will possibly allow future schedulers to make more intelligent scheduling decisions, as well as permit a more exact understanding of request load. I would propose the tweak of allowing a single ResourceRequest to encapsulate all the location information for a task. So instead of just a single location, a ResourceRequest would contain an array of locations, including nodes that it would be happy with, racks that it would be happy with, and possibly *. Side effects of this change would be a reduction in the amount of data that needs to be transferred in a heartbeat, as well in as the RM's memory footprint, becaused what used to be different requests for the same task are now able to share some common data. While this change breaks compatibility, if it is going to happen, it makes sense to do it now, before YARN becomes beta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-225) Proxy Link in RM UI thows NPE in Secure mode
[ https://issues.apache.org/jira/browse/YARN-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540470#comment-13540470 ] Robert Joseph Evans commented on YARN-225: -- That does look to be the correct patch assuming that the stack trace was against 2.0.2 or before. Either way it is a fix that needs to go in, because I misread the HttpServletRequestJavadocs and missed or null if the request has no cookies. The fix needs to go into branch-0.23 as well. I am +1 for the fix and will check it in. Proxy Link in RM UI thows NPE in Secure mode Key: YARN-225 URL: https://issues.apache.org/jira/browse/YARN-225 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha, 2.0.1-alpha, 2.0.3-alpha Reporter: Devaraj K Assignee: Devaraj K Priority: Critical Attachments: YARN-225.patch {code:xml} java.lang.NullPointerException at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:241) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:975) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container
[ https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-293: - Priority: Critical (was: Major) Target Version/s: 0.23.6 Affects Version/s: 0.23.3 It looks like some of the wiring is in place for this. We just need to send an ABORT_LOCALIZATION event when the RM tells the NM the app is done. Node Manager leaks LocalizerRunner object for every Container -- Key: YARN-293 URL: https://issues.apache.org/jira/browse/YARN-293 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha Reporter: Devaraj K Priority: Critical Node Manager creates a new LocalizerRunner object for every container and puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but it never removes from the map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-293) Node Manager leaks LocalizerRunner object for every Container
[ https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540006#comment-13540006 ] Robert Joseph Evans commented on YARN-293: -- Sorry looking at it more closely it is actually per container ID, so we need to send an event when the container is cleaned up. Node Manager leaks LocalizerRunner object for every Container -- Key: YARN-293 URL: https://issues.apache.org/jira/browse/YARN-293 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha Reporter: Devaraj K Priority: Critical Node Manager creates a new LocalizerRunner object for every container and puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but it never removes from the map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-293) Node Manager leaks LocalizerRunner object for every Container
[ https://issues.apache.org/jira/browse/YARN-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated YARN-293: - Attachment: YARN-293-trunk.txt This turned out to be a much smaller change then I originally thought. I just added in the cleanup to a handled that was already being called for all containers to delete the container's resources. Node Manager leaks LocalizerRunner object for every Container -- Key: YARN-293 URL: https://issues.apache.org/jira/browse/YARN-293 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.2-alpha, 0.23.3, 2.0.1-alpha Reporter: Devaraj K Assignee: Robert Joseph Evans Priority: Critical Attachments: YARN-293-trunk.txt Node Manager creates a new LocalizerRunner object for every container and puts in ResourceLocalizationService.LocalizerTracker.privLocalizers map but it never removes from the map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-2) Enhance CS to schedule accounting for both memory and cpu cores
[ https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540081#comment-13540081 ] Robert Joseph Evans commented on YARN-2: I chatted with Arun off line a bit about this, and he pointed out to me that the APIs are marked as Evolving, I should read the patch more closely next time. So I am OK with putting it in with the API as it is. I still think that having a float for the API is preferable, but until we actually start using it in practice we will not know what the real issues are. Enhance CS to schedule accounting for both memory and cpu cores --- Key: YARN-2 URL: https://issues.apache.org/jira/browse/YARN-2 Project: Hadoop YARN Issue Type: New Feature Components: capacityscheduler, scheduler Reporter: Arun C Murthy Assignee: Arun C Murthy Fix For: 2.0.3-alpha Attachments: MAPREDUCE-4327.patch, MAPREDUCE-4327.patch, MAPREDUCE-4327.patch, MAPREDUCE-4327-v2.patch, MAPREDUCE-4327-v3.patch, MAPREDUCE-4327-v4.patch, MAPREDUCE-4327-v5.patch, YARN-2-help.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch, YARN-2.patch With YARN being a general purpose system, it would be useful for several applications (MPI et al) to specify not just memory but also CPU (cores) for their resource requirements. Thus, it would be useful to the CapacityScheduler to account for both. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537168#comment-13537168 ] Robert Joseph Evans commented on YARN-276: -- I am not an expert on the scheduler code, so I have not done an in depth review of the patch. My biggest concern with this is that there is no visibility in the UI/web services about why an app may not have been scheduled. It would be great if you could update CapacitySchedulerLeafQueueInfo.java and the web page that uses it CapacitySchedulerPage.java. Capacity Scheduler can hang when submit many jobs concurrently -- Key: YARN-276 URL: https://issues.apache.org/jira/browse/YARN-276 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0, 2.0.1-alpha Reporter: nemon lou Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch Original Estimate: 24h Remaining Estimate: 24h In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity scheduler can hang with most resources taken up by AM and don't have enough resources for tasks.And then all applications hang there. The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not check directly.Instead ,this property only used for maxActiveApplications. And maxActiveApplications is computed by minimumAllocation (not by Am actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-204) test coverage for org.apache.hadoop.tools
[ https://issues.apache.org/jira/browse/YARN-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504710#comment-13504710 ] Robert Joseph Evans commented on YARN-204: -- +1 the new changes look better. I'll check this in. test coverage for org.apache.hadoop.tools - Key: YARN-204 URL: https://issues.apache.org/jira/browse/YARN-204 Project: Hadoop YARN Issue Type: Bug Components: applications Reporter: Aleksey Gorshkov Assignee: Aleksey Gorshkov Attachments: YARN-204-branch-0.23-a.patch, YARN-204-branch-0.23-b.patch, YARN-204-branch-0.23.patch, YARN-204-branch-2-a.patch, YARN-204-branch-2-b.patch, YARN-204-branch-2.patch, YARN-204-trunk-a.patch, YARN-204-trunk-b.patch, YARN-204-trunk.patch Added some tests for org.apache.hadoop.tools -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables
[ https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503890#comment-13503890 ] Robert Joseph Evans commented on YARN-237: -- You have to be careful with cookies because the web app proxy strips out cookies before sending the data to the application. Refreshing the RM page forgets how many rows I had in my Datatables --- Key: YARN-237 URL: https://issues.apache.org/jira/browse/YARN-237 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0 Reporter: Ravi Prakash If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows. This user preference should be stored in a cookie. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira