Eric Yang created YARN-8569:
-------------------------------

             Summary: Create an interface to provide cluster information to 
application
                 Key: YARN-8569
                 URL: https://issues.apache.org/jira/browse/YARN-8569
             Project: Hadoop YARN
          Issue Type: Sub-task
            Reporter: Eric Yang


Some program requires container hostnames to be known for application to run.  
For example, distributed tensorflow requires launch_command that looks like:

{code}
# On ps0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1
{code}

This is a bit cumbersome to orchestrate via Distributed Shell, or YARN services 
launch_command.  In addition, the dynamic parameters do not work with YARN flex 
command.  This is the classic pain point for application developer attempt to 
automate system environment settings as parameter to end user application.

It would be great if YARN Docker integration can provide a simple option to 
expose hostnames of the yarn service via a mounted file.  The file content gets 
updated when flex command is performed.  This allows application developer to 
consume system environment settings via a standard interface.  It is like 
/proc/devices for Linux, but for Hadoop.  This may involve updating a file in 
distributed cache, and allow mounting of the file via container-executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to