[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182 ] Saisai Shao commented on SPARK-24723: - Hi [~mengxr], I don't think YARN has such feature to configure password-less SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment (Ambari), we don't have use password-less ssh. {quote}And does container by default run sshd? If not, which process is responsible for starting/terminating the daemon? {quote} If the container is is not dockerized, so it will share with system's sshd, it is system's responsibility to start/terminate this daemon. If the container is dockerized, I think the docker container should be responsible for starting sshd (IIUC). Maybe we should check if sshd is started before starting MPI job, if sshd is not started, simply we cannot run MPI job no matter who is responsible for sshd daemon. [~leftnoteasy] might have some thoughts, since he is the originator of mpich2-yarn. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Saisai Shao >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. > > Requirements: > * understand how to set up YARN to run MPI job as a YARN application > * figure out how to do it with Spark w/ Barrier -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550048#comment-16550048 ] Xiangrui Meng commented on SPARK-24723: --- [~jerryshao] Does YARN have the feature that will by default configure passwordless SSH on all containers (or per application)? If Spark generates the key files in barrier mode on YARN, it might break this feature provided by YARN. And does container by default run sshd? If not, which process is responsible for starting/terminating the daemon? > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565 ] Saisai Shao commented on SPARK-24723: - [~mengxr] [~jiangxb1987] There's one solution to handle password-less SSH problem for all cluster manager in a programming way. This is referred from MPI on YARN framework [https://github.com/alibaba/mpich2-yarn] In this MPI on YARN framework, before launching MPI job, application master (master) will generate ssh private key and public key and then propagate the public key to all the containers (worker), during container start, it will write public key to local authorized_keys file, so after that, MPI job started from master node can ssh with all the containers in password-less manner. After MPI job is finished, all the containers would delete this public key from authorized_keys file to revert the environment. In our case, we could do this in a similar way, before launching MPI job, 0-th task could also generate ssh private key and public key, and then propagate the public keys to all the barrier task (maybe through BarrierTaskContext). For other tasks, they could receive public key from 0-th task and write public key to authorized_keys file (maybe by BarrierTaskContext). After this, password-less ssh is set up, mpirun from 0-th task could be started without password. After MPI job is finished, all the barrier tasks could delete this public key from authorized_keys file to revert the environment. The example code is like below: rdd.barrier().mapPartitions { (iter, context) => // Write iter to disk.??? // Wait until all tasks finished writing. context.barrier() // The 0-th task launches an MPI job. if (context.partitionId() == 0) { // generate and propagate ssh keys. // Wait for keys to set up in other tasks. val hosts = context.getTaskInfos().map(_.host) // Set up MPI machine file using host infos. ??? // Launch the MPI job by calling mpirun. ??? } else { // get and setup public key // notify 0-th task that pubic key is setup. } // Wait until the MPI job finished. context.barrier() // Delete SSH key and revert the environment. // Collect output and return.??? } What is your opinion about this solution? > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534529#comment-16534529 ] Saisai Shao commented on SPARK-24723: - After discussed with Xiangrui offline, resource reservation is not the key focus here. Here the main problem is how to provide necessary information for barrier tasks to start MPI job in a password-less manner. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533205#comment-16533205 ] Saisai Shao commented on SPARK-24723: - Hi [~mengxr], I would like to know the goal of this ticket? The goal of barrier scheduler is to offer gang semantics in the task scheduling level, whereas the gang semantics in the YARN level is more regarding to resource level. I discussed with [~leftnoteasy] about the feasibility of supporting gang semantics on YARN. YARN has Reservation System which support gang like semantics (reserve requested resources), but it is not designed for gang. Here is some thoughts about supporting it on YARN [https://docs.google.com/document/d/1OA-iVwuHB8wlzwwlrEHOK6Q2SlKy3-5QEB5AmwMVEUU/edit?usp=sharing], I'm not sure if it aligns your goal of this ticket. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org