[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142086#comment-14142086
 ] 

Steve Loughran commented on YARN-913:
-------------------------------------

bq. I have some concern around 'naked' zookeeper.* config option

This something that I do think needs changing in ZK; being driven by JVM 
properties can work for standalone JVM servers, but not for clients. The client 
here sets the properties just before needed (e.g. the SASL auth details), and I 
was thinking of making the set-connect operation class synchronized. 
But...curator does some session restarting and if those JVM-wide settings are 
changed, there may be problems. Summary: need to fix ZK client and then have 
curator configure it, so the rest of us don't have to care.

bq.  if a user kills the ZK used for app registry through some action, what 
happens to the RM and other user's bits that are running

# The RM isn't depending on the ZK cluster for information; it just sets up the 
paths for a user, and does purges of container & app lifespan parts on their 
completion. I've made both the setup and teardown operations async; the 
{{RMRegistryOperationsService}} class gets the RM event and schedules the work 
on its executor. If ZK is offline then these will block until the quorum is 
back, but it should not delay RM operations. It could block the clients and the 
AM starting up.

# Curator supports different {{EnsembleProviders}} .. classes which provide the 
data needed for the client to reconnect to ZK. The code is currently only 
hooked up to one -the {{FixedEnsembleProvider}}, which uses a classic static ZK 
quorum. There's an alternative, the {{ExhibitorProvider}}, which hooks up to 
[Netflix Exhibitor|https://github.com/Netflix/exhibitor/wiki|] and can do 
things like [[Rolling Ensemble 
Change|https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change]]. 
This is designed for cloud deployments where a ZK server failure results in a 
new host coming up, with new hostname/address ... exhibitor handles the details 
of rebinding.  

I haven't added explicit support for that (straightforward) or got a test setup 
(harder). If you want to play with it though ...


bq. Why doesn't the hostname component allow for FQDNs?

do you mean in the endpoint fields? It should ... let me clarify that in the 
example.

bq. Are we prepared for more backlash when another component requires working 
DNS?

The reason the initial patches here weren't building is a helper method to 
build up an endpoint address from an {{InetSocketAddress}} called 
{{getHostString()}} to get the host/FQDN, without doing any DNS work. I had to 
switch to {{getHostName()}}, which can try to do rDNS, and so rely on DNS 
working.

bq. Is ZK the right thing to use here?


# ZK gives us availability; I do plan to add a REST API later on, one that 
works long-haul. It's why there is deliberately no support for ephemeral nodes 
... the {{RegistryOperations}} interface is designed to implementable by a REST 
client, for which there won't be any sessions to tie ephemeral nodes to. 

# By deliberately publishing nothing but endpoints to services, we're trying to 
keep the content in the store down, with the bulk data being served up by other 
means. In slider, we are publishing dynamically generated config files from the 
AM REST API; all the registry entry does is list the API + URL for that 
service. 

# I do like your idea about just sticking stuff into HDFS, S3, etc.; that's a 
way to share content too, including config data. It'll fit into the general 
category of URL formatted endpoint —maybe I should add it as an explicit 
address type, "filesystem"? 



> Add a way to register long-lived services in a YARN cluster
> -----------------------------------------------------------
>
>                 Key: YARN-913
>                 URL: https://issues.apache.org/jira/browse/YARN-913
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: api, resourcemanager
>    Affects Versions: 2.5.0, 2.4.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
> 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
> YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
> YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
> YARN-913-007.patch, YARN-913-008.patch, yarnregistry.pdf, yarnregistry.tla
>
>
> In a YARN cluster you can't predict where services will come up -or on what 
> ports. The services need to work those things out as they come up and then 
> publish them somewhere.
> Applications need to be able to find the service instance they are to bond to 
> -and not any others in the cluster.
> Some kind of service registry -in the RM, in ZK, could do this. If the RM 
> held the write access to the ZK nodes, it would be more secure than having 
> apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to