Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Hi Holden,

Thank you for your points. I guess coming from a corporate world I had an
oversight on how an open source project like Spark does leverage resources
and interest :).

As @KlausMa kindly volunteered it would be good to hear scheduling ideas on
Spark on Kubernetes and of course as I am sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes 

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 16:59, Holden Karau  wrote:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on 

Issue with Running Spark in Jupyter Notebook

2021-06-24 Thread Hsu, Philip
Hi there,

My name is Philip, a master’s student at Imperial College London. I’m trying to 
use Spark to complete my course work assignment. I ran the following code:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

and got the following error message:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.sparkproject.jetty.http.MimeTypes
at 
org.sparkproject.jetty.server.handler.gzip.GzipHandler.(GzipHandler.java:190)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:485)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:147)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:147)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:147)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:486)
at 
org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:486)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext.(SparkContext.scala:486)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

While in my Macbook’s terminal, it’s showing following error messages:


WARNING: An illegal reflective access operation has occurred

pyspark_mongodb_nb  | WARNING: Illegal reflective access by 
org.apache.spark.unsafe.Platform 
(file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to 
constructor java.nio.DirectByteBuffer(long,int)

pyspark_mongodb_nb  | WARNING: Please consider reporting this to the 
maintainers of org.apache.spark.unsafe.Platform

pyspark_mongodb_nb  | WARNING: Use --illegal-access=warn to enable warnings of 
further illegal reflective access operations

pyspark_mongodb_nb  | WARNING: All illegal access operations will be denied in 
a future release

pyspark_mongodb_nb  | 21/06/24 06:57:17 WARN NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

pyspark_mongodb_nb  | Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

pyspark_mongodb_nb  | Setting default log level to "WARN".

pyspark_mongodb_nb  | To adjust logging level use sc.setLogLevel(newLevel). For 
SparkR, use setLogLevel(newLevel).

pyspark_mongodb_nb  | 21/06/24 06:57:20 WARN MacAddressUtil: Failed to find a 
usable hardware address from the network interfaces; using random bytes: 
bd:af:a7:b4:a2:46:2a:28

I’m wondering if you could help me resolve the issues I have with my laptop. I 
have a 2020 MacBook Pro with a M1 chip. Thank you so much in advance.

Best,

Philip Hsu


Re: Issue with Running Spark in Jupyter Notebook

2021-06-24 Thread Artemis User
Looks like you didn't set up your environment properly.  I assume you 
are running this from a standalone python program instead of from the 
pyspark shell.  I would first run your code from the pyspark shell, then 
follow the spark python installation guide to set up your python 
environment properly.  Please note these are extra steps in addition to 
Spark installation.


-- ND

On 6/24/21 3:08 AM, Hsu, Philip wrote:


Hi there,

My name is Philip, a master’s student at Imperial College London. I’m 
trying to use Spark to complete my course work assignment. I ran the 
following code:


from pyspark import SparkContext

sc = SparkContext.getOrCreate()

and got the following error message:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.


: java.lang.NoClassDefFoundError: Could not initialize class 
org.sparkproject.jetty.http.MimeTypes


    at 
org.sparkproject.jetty.server.handler.gzip.GzipHandler.(GzipHandler.java:190)


    at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:485)

    at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:147)

    at 
org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:147)


    at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)


    at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)


    at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)


    at org.apache.spark.ui.WebUI.bind(WebUI.scala:147)

    at 
org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:486)


    at 
org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:486)


    at scala.Option.foreach(Option.scala:407)

    at org.apache.spark.SparkContext.(SparkContext.scala:486)

    at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)


    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)


    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)


    at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)


    at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)


    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)

    at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)


    at py4j.Gateway.invoke(Gateway.java:238)

    at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)


    at 
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)


    at py4j.GatewayConnection.run(GatewayConnection.java:238)

    at java.base/java.lang.Thread.run(Thread.java:829)

While in my Macbook’s terminal, it’s showing following error messages:

WARNING: An illegal reflective access operation has occurred

pyspark_mongodb_nb|WARNING: Illegal reflective access by 
org.apache.spark.unsafe.Platform 
(file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) 
to constructor java.nio.DirectByteBuffer(long,int)


pyspark_mongodb_nb|WARNING: Please consider reporting this to the 
maintainers of org.apache.spark.unsafe.Platform


pyspark_mongodb_nb|WARNING: Use --illegal-access=warn to enable 
warnings of further illegal reflective access operations


pyspark_mongodb_nb|WARNING: All illegal access operations will be 
denied in a future release


pyspark_mongodb_nb|21/06/24 06:57:17 WARN NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java 
classes where applicable


pyspark_mongodb_nb|Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties


pyspark_mongodb_nb|Setting default log level to "WARN".

pyspark_mongodb_nb|To adjust logging level use 
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


pyspark_mongodb_nb|21/06/24 06:57:20 WARN MacAddressUtil: Failed to 
find a usable hardware address from the network interfaces; using 
random bytes: bd:af:a7:b4:a2:46:2a:28


I’m wondering if you could help me resolve the issues I have with my 
laptop. I have a 2020 MacBook Pro with a M1 chip. Thank you so much in 
advance.


Best,

Philip Hsu





Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Lalwani, Jayesh
You can always chain aggregations by chaining multiple Structured Streaming 
jobs. It’s not a showstopper.

Getting Spark on Kubernetes is important for organizations that want to pursue 
a multi-cloud strategy

From: Mich Talebzadeh 
Date: Wednesday, June 23, 2021 at 11:27 AM
To: "user @spark" 
Cc: dev 
Subject: RE: [EXTERNAL] Spark on Kubernetes scheduler variety


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.




Please allow me to be diverse and express a different point of view on this 
roadmap.

I believe from a technical point of view spending time and effort plus talent 
on batch scheduling on Kubernetes could be rewarding. However, if I may say I 
doubt whether such an approach and the so-called democratization of Spark on 
whatever platform is really should be of great focus.
Having worked on Google Dataproc (A fully 
managed and highly scalable service for running Apache Spark, Hadoop and more 
recently other artefacts) for that past two years, and Spark on Kubernetes 
on-premise, I have come to the conclusion that Spark is not a beast that that 
one can fully commoditize it much like one can do with  Zookeeper, Kafka etc. 
There is always a struggle to make some niche areas of Spark like Spark 
Structured Streaming (SSS) work seamlessly and effortlessly on these commercial 
platforms with whatever as a Service.

Moreover, Spark (and I stand corrected) from the ground up has already a lot of 
resiliency and redundancy built in. It is truly an enterprise class product 
(requires enterprise class support) that will be difficult to commoditize with 
Kubernetes and expect the same performance. After all, Kubernetes is aimed at 
efficient resource sharing and potential cost saving for the mass market. In 
short I can see commercial enterprises will work on these platforms ,but may be 
the great talents on dev team should focus on stuff like the perceived 
limitation of SSS in dealing with chain of aggregation( if I am correct it is 
not yet supported on streaming datasets)

These are my opinions and they are not facts, just opinions so to speak :)

 [Image removed by sender.]   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I think these approaches are good, but there are limitations (eg dynamic 
scaling) without us making changes inside of the Spark Kube scheduler.

Certainly whichever scheduler extensions we add support for we should 
collaborate with the people developing those extensions insofar as they are 
interested. My first place that I checked was #sig-scheduling which is fairly 
quite on the Kubernetes slack but if there are more places to look for folks 
interested in batch scheduling on Kubernetes we should definitely give it a 
shot :)

On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Regarding your point and I quote

"..  I know that one of the Spark on Kube operators supports volcano/kube-batch 
so I was thinking that might be a place I would start exploring..."

There seems to be ongoing work on say Volcano as part of  Cloud Native 
Computing Foundation (CNCF). For example through 
https://github.com/volcano-sh/volcano

There may be value-add in collaborating with such groups through CNCF in order 
to have a collective approach to such work. There also seems to be some work on 
Integration of Spark with Volcano for Batch 
Scheduling.



What is not very clear is the degree of progress of these projects. You may be 
kind enough to elaborate on KPI for each of these projects and where you think 
your contributions is going to be.



HTH,



Mich


 [Image removed by sender.]   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 00:44, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

I'm continuing my adventures to make Spark on containers party and I
was wondering if folks have experience with the different batch

[ANNOUNCE] Apache Spark 3.0.3 released

2021-06-24 Thread Yi Wu
We are happy to announce the availability of Spark 3.0.3!

Spark 3.0.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.0 maintenance branch of Spark. We strongly
recommend all 3.0 users to upgrade to this stable release.

To download Spark 3.0.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-0-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Yi


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Klaus! I am interested in more details.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great.

It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.

Regards,

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 02:53, Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very