[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I don't think YARN has such feature to configure password-less 
SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment 
(Ambari), we don't have use password-less ssh.
{quote}And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?
{quote}
If the container is is not dockerized, so it will share with system's sshd, it 
is system's responsibility to start/terminate this daemon.

If the container is dockerized, I think the docker container should be 
responsible for starting sshd (IIUC).

Maybe we should check if sshd is started before starting MPI job, if sshd is 
not started, simply we cannot run MPI job no matter who is responsible for sshd 
daemon.

[~leftnoteasy] might have some thoughts, since he is the originator of 
mpich2-yarn.

 

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550048#comment-16550048
 ] 

Xiangrui Meng commented on SPARK-24723:
---

[~jerryshao] Does YARN have the feature that will by default configure 
passwordless SSH on all containers (or per application)? If Spark generates the 
key files in barrier mode on YARN, it might break this feature provided by 
YARN. And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565
 ] 

Saisai Shao commented on SPARK-24723:
-

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>
  // Write iter to disk.???
  // Wait until all tasks finished writing.
  context.barrier()
  // The 0-th task launches an MPI job.
  if (context.partitionId() == 0) {
// generate and propagate ssh keys.
// Wait for keys to set up in other tasks.

val hosts = context.getTaskInfos().map(_.host)  
// Set up MPI machine file using host infos.  ???
// Launch the MPI job by calling mpirun.  ???
  } else {
// get and setup public key
// notify 0-th task that pubic key is setup.
  }  
  // Wait until the MPI job finished.
  context.barrier()

  // Delete SSH key and revert the environment.  
  // Collect output and return.???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534529#comment-16534529
 ] 

Saisai Shao commented on SPARK-24723:
-

After discussed with Xiangrui offline, resource reservation is not the key 
focus here. Here the main problem is how to provide necessary information for 
barrier tasks to start MPI job in a password-less manner.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533205#comment-16533205
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I would like to know the goal of this ticket? The goal of barrier 
scheduler is to offer gang semantics in the task scheduling level, whereas the 
gang semantics in the YARN level is more regarding to resource level.

I discussed with [~leftnoteasy] about the feasibility of supporting gang 
semantics on YARN. YARN has Reservation System which support gang like 
semantics (reserve requested resources), but it is not designed for gang. Here 
is some thoughts about supporting it on YARN 
[https://docs.google.com/document/d/1OA-iVwuHB8wlzwwlrEHOK6Q2SlKy3-5QEB5AmwMVEUU/edit?usp=sharing],
 I'm not sure if it aligns your goal of this ticket.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org