[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142086#comment-14142086 ]
Steve Loughran commented on YARN-913: ------------------------------------- bq. I have some concern around 'naked' zookeeper.* config option This something that I do think needs changing in ZK; being driven by JVM properties can work for standalone JVM servers, but not for clients. The client here sets the properties just before needed (e.g. the SASL auth details), and I was thinking of making the set-connect operation class synchronized. But...curator does some session restarting and if those JVM-wide settings are changed, there may be problems. Summary: need to fix ZK client and then have curator configure it, so the rest of us don't have to care. bq. if a user kills the ZK used for app registry through some action, what happens to the RM and other user's bits that are running # The RM isn't depending on the ZK cluster for information; it just sets up the paths for a user, and does purges of container & app lifespan parts on their completion. I've made both the setup and teardown operations async; the {{RMRegistryOperationsService}} class gets the RM event and schedules the work on its executor. If ZK is offline then these will block until the quorum is back, but it should not delay RM operations. It could block the clients and the AM starting up. # Curator supports different {{EnsembleProviders}} .. classes which provide the data needed for the client to reconnect to ZK. The code is currently only hooked up to one -the {{FixedEnsembleProvider}}, which uses a classic static ZK quorum. There's an alternative, the {{ExhibitorProvider}}, which hooks up to [Netflix Exhibitor|https://github.com/Netflix/exhibitor/wiki|] and can do things like [[Rolling Ensemble Change|https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change]]. This is designed for cloud deployments where a ZK server failure results in a new host coming up, with new hostname/address ... exhibitor handles the details of rebinding. I haven't added explicit support for that (straightforward) or got a test setup (harder). If you want to play with it though ... bq. Why doesn't the hostname component allow for FQDNs? do you mean in the endpoint fields? It should ... let me clarify that in the example. bq. Are we prepared for more backlash when another component requires working DNS? The reason the initial patches here weren't building is a helper method to build up an endpoint address from an {{InetSocketAddress}} called {{getHostString()}} to get the host/FQDN, without doing any DNS work. I had to switch to {{getHostName()}}, which can try to do rDNS, and so rely on DNS working. bq. Is ZK the right thing to use here? # ZK gives us availability; I do plan to add a REST API later on, one that works long-haul. It's why there is deliberately no support for ephemeral nodes ... the {{RegistryOperations}} interface is designed to implementable by a REST client, for which there won't be any sessions to tie ephemeral nodes to. # By deliberately publishing nothing but endpoints to services, we're trying to keep the content in the store down, with the bulk data being served up by other means. In slider, we are publishing dynamically generated config files from the AM REST API; all the registry entry does is list the API + URL for that service. # I do like your idea about just sticking stuff into HDFS, S3, etc.; that's a way to share content too, including config data. It'll fit into the general category of URL formatted endpoint —maybe I should add it as an explicit address type, "filesystem"? > Add a way to register long-lived services in a YARN cluster > ----------------------------------------------------------- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager > Affects Versions: 2.5.0, 2.4.1 > Reporter: Steve Loughran > Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, > YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, > YARN-913-007.patch, YARN-913-008.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)