[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2020-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27941:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision, maintain and manage a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively, it would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package and rebuilding 
> the entire project. Having a pluggable scheduler per 
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
> different scheduler backends backed by different cloud providers.
>  * client-cluster communication: I am not very familiar with the network part 
> of the code base so I might be wrong on this. My understanding is that the 
> code base assumes that the client and the cluster are on the same network and 
> the nodes communicate with each other via hostname/ip. For security best 
> practice, it is advised to run the executors in a private protected network, 
> which may be separate from the client machine's network. Since we are 
> serverless, that means the client need to first launch the driver into the 
> private network, and the driver in turn start the executors, potentially 
> doubling job initialization time. This can be solved by dropping complete 
> serverlessness and having a persistent host in the private network, or (I do 
> not have a POC, so I am not sure if this actually works) by implementing 
> client-cluster communication via message queues in the cloud to get around 
> the network separation.
>  * shuffle storage and retrieval: external shuffle in yarn relies on the 
> existence of a persistent cluster that continues to serve shuffle files 
> beyond the lifecycle of the executors. This assumption no longer holds in a 
> serverless cluster with only transient containers. Pluggable remote shuffle 
> storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it 
> easier to introduce new cloud-backed shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision, maintain and manage a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively, it would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. For security best practice, 
it is advised to run the executors in a private protected network, which may be 
separate from the client machine's network. Since we are serverless, that means 
the client need to first launch the driver into the private network, and the 
driver in turn start the executors, potentially doubling job initialization 
time. This can be solved by dropping complete serverlessness and having a 
persistent host in the private network, or (I do not have a POC, so I am not 
sure if this actually works) by implementing client-cluster communication via 
message queues in the cloud to get around the network separation.
 * shuffle storage and retrieval: external shuffle in yarn relies on the 
existence of a persistent cluster that continues to serve shuffle files beyond 
the lifecycle of the executors. This assumption no longer holds in a serverless 
cluster with only transient containers. Pluggable remote shuffle storage per 
https://issues.apache.org/jira/browse/SPARK-25299 would make it easier to 
introduce new cloud-backed shuffle.

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. 
 * shuffle storage and retrieval: 


> Serverless Spark in the 

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package, and rebuilding 
the entire project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted 

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package, and rebuilding 
the entire project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster.

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and delegate authentication and authorization 
to the cloud.
 * deployment & scheduler: adding a new cluster manager requires change a 
number of places in the Spark core package, and rebuilding the project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package, and 
> rebuilding the entire project. 
>  * driver-executor communication: 
>  * shuffle storage and retrieval: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster.

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and delegate authentication and authorization 
to the cloud.
 * deployment & scheduler: adding a new cluster manager requires change a 
number of places in the Spark core package, and rebuilding the project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

Pluggable authentication

Pluggable scheduler

Pluggable shuffle storage and retrieval

Pluggable driver


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster.
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and delegate authentication and authorization 
> to the cloud.
>  * deployment & scheduler: adding a new cluster manager requires change a 
> number of places in the Spark core package, and rebuilding the project. 
>  * driver-executor communication: 
>  * shuffle storage and retrieval: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

Pluggable authentication

Pluggable scheduler

Pluggable shuffle storage and retrieval

Pluggable driver

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

Pluggable authentication

Pluggable scheduler

Pluggable shuffle storage and retrival

Pluggable driver


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner. 
> Pluggable authentication
> Pluggable scheduler
> Pluggable shuffle storage and retrieval
> Pluggable driver



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

Pluggable authentication

Pluggable scheduler

Pluggable shuffle storage and retrival

Pluggable driver

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

 


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner. 
> Pluggable authentication
> Pluggable scheduler
> Pluggable shuffle storage and retrival
> Pluggable driver



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org