Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Mich Talebzadeh
Splendid.

Please invite me to the next meeting

mich.talebza...@gmail.com

Timezone London, UK  *GMT+1*

Thanks,


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 8 Jul 2021 at 19:04, Holden Karau  wrote:

> Hi Y'all,
>
> We had an initial meeting which went well, got some more context around
> Volcano and its near-term roadmap. Talked about the impact around scheduler
> deadlocking and some ways that we could potentially improve integration
> from the Spark side and Volcano sides respectively. I'm going to start
> creating some sub-issues under
> https://issues.apache.org/jira/browse/SPARK-36057
>
> If anyone is interested in being on the next meeting please reach out and
> I'll send an e-mail around to try and schedule re-occurring sync that works
> for folks.
>
> Cheers,
>
> Holden
>
> On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:
>
>> That's awesome, I'm just starting to get context around Volcano but maybe
>> we can schedule an initial meeting for all of us interested in pursuing
>> this to get on the same page.
>>
>> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>>
>>> Hi team,
>>>
>>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>>> community also has such requirements :)
>>>
>>> Volcano provides several features for batch workload, e.g. fair-share,
>>> queue, reservation, preemption/reclaim and so on.
>>> It has been used in several product environments with Spark; if
>>> necessary, I can give an overall introduction about Volcano's features and
>>> those use cases :)
>>>
>>> -- Klaus
>>>
>>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>


 Please allow me to be diverse and express a different point of view on
 this roadmap.


 I believe from a technical point of view spending time and effort plus
 talent on batch scheduling on Kubernetes could be rewarding. However, if I
 may say I doubt whether such an approach and the so-called democratization
 of Spark on whatever platform is really should be of great focus.

 Having worked on Google Dataproc  (A 
 fully
 managed and highly scalable service for running Apache Spark, Hadoop and
 more recently other artefacts) for that past two years, and Spark on
 Kubernetes on-premise, I have come to the conclusion that Spark is not a
 beast that that one can fully commoditize it much like one can do with
 Zookeeper, Kafka etc. There is always a struggle to make some niche areas
 of Spark like Spark Structured Streaming (SSS) work seamlessly and
 effortlessly on these commercial platforms with whatever as a Service.


 Moreover, Spark (and I stand corrected) from the ground up has already
 a lot of resiliency and redundancy built in. It is truly an enterprise
 class product (requires enterprise class support) that will be difficult to
 commoditize with Kubernetes and expect the same performance. After all,
 Kubernetes is aimed at efficient resource sharing and potential cost saving
 for the mass market. In short I can see commercial enterprises will work on
 these platforms ,but may be the great talents on dev team should focus on
 stuff like the perceived limitation of SSS in dealing with chain of
 aggregation( if I am correct it is not yet supported on streaming datasets)


 These are my opinions and they are not facts, just opinions so to speak
 :)


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 18 Jun 2021 at 23:18, Holden Karau 
 wrote:

> I think these approaches are good, but there are limitations (eg
> dynamic scaling) without us making changes inside of the Spark Kube
> scheduler.
>
> Certainly whichever scheduler extensions we add support for we should
> collaborate with the people developing those extensions insofar as they 
> are
> interested. My first place that I checked was #sig-scheduling which is
> fairly quite on the Kubernetes slack but if there are more places to look
> for folks interested in batch scheduling 

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
Hi Y'all,

We had an initial meeting which went well, got some more context around
Volcano and its near-term roadmap. Talked about the impact around scheduler
deadlocking and some ways that we could potentially improve integration
from the Spark side and Volcano sides respectively. I'm going to start
creating some sub-issues under
https://issues.apache.org/jira/browse/SPARK-36057

If anyone is interested in being on the next meeting please reach out and
I'll send an e-mail around to try and schedule re-occurring sync that works
for folks.

Cheers,

Holden

On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:

> That's awesome, I'm just starting to get context around Volcano but maybe
> we can schedule an initial meeting for all of us interested in pursuing
> this to get on the same page.
>
> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>
>> Hi team,
>>
>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>> community also has such requirements :)
>>
>> Volcano provides several features for batch workload, e.g. fair-share,
>> queue, reservation, preemption/reclaim and so on.
>> It has been used in several product environments with Spark; if
>> necessary, I can give an overall introduction about Volcano's features and
>> those use cases :)
>>
>> -- Klaus
>>
>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc  (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
 I think these approaches are good, but there are limitations (eg
 dynamic scaling) without us making changes inside of the Spark Kube
 scheduler.

 Certainly whichever scheduler extensions we add support for we should
 collaborate with the people developing those extensions insofar as they are
 interested. My first place that I checked was #sig-scheduling which is
 fairly quite on the Kubernetes slack but if there are more places to look
 for folks interested in batch scheduling on Kubernetes we should definitely
 give it a shot :)

 On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> Regarding your point and I quote
>
> "..  I know that one of the Spark on Kube operators
> supports volcano/kube-batch so I was thinking that might be a place I 
> would
> start exploring..."
>
> There seems to be ongoing work on say Volcano as part of  Cloud
> Native Computing Foundation  (CNCF). For example
> through https://github.com/volcano-sh/volcano
>
 
>
> There may be value-add in collaborating with such groups 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Hi Holden,

Thank you for your points. I guess coming from a corporate world I had an
oversight on how an open source project like Spark does leverage resources
and interest :).

As @KlausMa kindly volunteered it would be good to hear scheduling ideas on
Spark on Kubernetes and of course as I am sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes 

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 16:59, Holden Karau  wrote:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Lalwani, Jayesh
You can always chain aggregations by chaining multiple Structured Streaming 
jobs. It’s not a showstopper.

Getting Spark on Kubernetes is important for organizations that want to pursue 
a multi-cloud strategy

From: Mich Talebzadeh 
Date: Wednesday, June 23, 2021 at 11:27 AM
To: "user @spark" 
Cc: dev 
Subject: RE: [EXTERNAL] Spark on Kubernetes scheduler variety


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.




Please allow me to be diverse and express a different point of view on this 
roadmap.

I believe from a technical point of view spending time and effort plus talent 
on batch scheduling on Kubernetes could be rewarding. However, if I may say I 
doubt whether such an approach and the so-called democratization of Spark on 
whatever platform is really should be of great focus.
Having worked on Google Dataproc<https://cloud.google.com/dataproc> (A fully 
managed and highly scalable service for running Apache Spark, Hadoop and more 
recently other artefacts) for that past two years, and Spark on Kubernetes 
on-premise, I have come to the conclusion that Spark is not a beast that that 
one can fully commoditize it much like one can do with  Zookeeper, Kafka etc. 
There is always a struggle to make some niche areas of Spark like Spark 
Structured Streaming (SSS) work seamlessly and effortlessly on these commercial 
platforms with whatever as a Service.

Moreover, Spark (and I stand corrected) from the ground up has already a lot of 
resiliency and redundancy built in. It is truly an enterprise class product 
(requires enterprise class support) that will be difficult to commoditize with 
Kubernetes and expect the same performance. After all, Kubernetes is aimed at 
efficient resource sharing and potential cost saving for the mass market. In 
short I can see commercial enterprises will work on these platforms ,but may be 
the great talents on dev team should focus on stuff like the perceived 
limitation of SSS in dealing with chain of aggregation( if I am correct it is 
not yet supported on streaming datasets)

These are my opinions and they are not facts, just opinions so to speak :)

 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I think these approaches are good, but there are limitations (eg dynamic 
scaling) without us making changes inside of the Spark Kube scheduler.

Certainly whichever scheduler extensions we add support for we should 
collaborate with the people developing those extensions insofar as they are 
interested. My first place that I checked was #sig-scheduling which is fairly 
quite on the Kubernetes slack but if there are more places to look for folks 
interested in batch scheduling on Kubernetes we should definitely give it a 
shot :)

On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Regarding your point and I quote

"..  I know that one of the Spark on Kube operators supports volcano/kube-batch 
so I was thinking that might be a place I would start exploring..."

There seems to be ongoing work on say Volcano as part of  Cloud Native 
Computing Foundation<https://cncf.io/> (CNCF). For example through 
https://github.com/volcano-sh/volcano

There may be value-add in collaborating with such groups through CNCF in order 
to have a collective approach to such work. There also seems to be some work on 
Integration of Spark with Volcano for Batch 
Scheduling.<https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>



What is not very clear is the degree of progress of these projects. You may be 
kind enough to elaborate on KPI for each of these projects and where you think 
your contributions is going to be.



HTH,



Mich


 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 00:44, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

I'm continuing my adventures to make Spark on containers party a

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Klaus! I am interested in more details.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great.

It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.

Regards,

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 02:53, Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very 

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Klaus Ma
Hi team,

I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
community also has such requirements :)

Volcano provides several features for batch workload, e.g. fair-share,
queue, reservation, preemption/reclaim and so on.
It has been used in several product environments with Spark; if necessary,
I can give an overall introduction about Volcano's features and those use
cases :)

-- Klaus

On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh 
wrote:

>
>
> Please allow me to be diverse and express a different point of view on
> this roadmap.
>
>
> I believe from a technical point of view spending time and effort plus
> talent on batch scheduling on Kubernetes could be rewarding. However, if I
> may say I doubt whether such an approach and the so-called democratization
> of Spark on whatever platform is really should be of great focus.
>
> Having worked on Google Dataproc  (A fully
> managed and highly scalable service for running Apache Spark, Hadoop and
> more recently other artefacts) for that past two years, and Spark on
> Kubernetes on-premise, I have come to the conclusion that Spark is not a
> beast that that one can fully commoditize it much like one can do with
> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
> of Spark like Spark Structured Streaming (SSS) work seamlessly and
> effortlessly on these commercial platforms with whatever as a Service.
>
>
> Moreover, Spark (and I stand corrected) from the ground up has already a
> lot of resiliency and redundancy built in. It is truly an enterprise class
> product (requires enterprise class support) that will be difficult to
> commoditize with Kubernetes and expect the same performance. After all,
> Kubernetes is aimed at efficient resource sharing and potential cost saving
> for the mass market. In short I can see commercial enterprises will work on
> these platforms ,but may be the great talents on dev team should focus on
> stuff like the perceived limitation of SSS in dealing with chain of
> aggregation( if I am correct it is not yet supported on streaming datasets)
>
>
> These are my opinions and they are not facts, just opinions so to speak :)
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>
>> I think these approaches are good, but there are limitations (eg dynamic
>> scaling) without us making changes inside of the Spark Kube scheduler.
>>
>> Certainly whichever scheduler extensions we add support for we should
>> collaborate with the people developing those extensions insofar as they are
>> interested. My first place that I checked was #sig-scheduling which is
>> fairly quite on the Kubernetes slack but if there are more places to look
>> for folks interested in batch scheduling on Kubernetes we should definitely
>> give it a shot :)
>>
>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Regarding your point and I quote
>>>
>>> "..  I know that one of the Spark on Kube operators
>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>> start exploring..."
>>>
>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>> Computing Foundation  (CNCF). For example through
>>> https://github.com/volcano-sh/volcano
>>>
>> 
>>>
>>> There may be value-add in collaborating with such groups through CNCF in
>>> order to have a collective approach to such work. There also seems to be
>>> some work on Integration of Spark with Volcano for Batch Scheduling.
>>> 
>>>
>>>
>>>
>>> What is not very clear is the degree of progress of these projects. You
>>> may be kind enough to elaborate on KPI for each of these projects and where
>>> you think your contributions is going to be.
>>>
>>>
>>> HTH,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau  wrote:
>>>
 Hi Folks,

 I'm continuing 

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Mich Talebzadeh
Please allow me to be diverse and express a different point of view on
this roadmap.


I believe from a technical point of view spending time and effort plus
talent on batch scheduling on Kubernetes could be rewarding. However, if I
may say I doubt whether such an approach and the so-called democratization
of Spark on whatever platform is really should be of great focus.

Having worked on Google Dataproc  (A fully
managed and highly scalable service for running Apache Spark, Hadoop and
more recently other artefacts) for that past two years, and Spark on
Kubernetes on-premise, I have come to the conclusion that Spark is not a
beast that that one can fully commoditize it much like one can do with
Zookeeper, Kafka etc. There is always a struggle to make some niche areas
of Spark like Spark Structured Streaming (SSS) work seamlessly and
effortlessly on these commercial platforms with whatever as a Service.


Moreover, Spark (and I stand corrected) from the ground up has already a
lot of resiliency and redundancy built in. It is truly an enterprise class
product (requires enterprise class support) that will be difficult to
commoditize with Kubernetes and expect the same performance. After all,
Kubernetes is aimed at efficient resource sharing and potential cost saving
for the mass market. In short I can see commercial enterprises will work on
these platforms ,but may be the great talents on dev team should focus on
stuff like the perceived limitation of SSS in dealing with chain of
aggregation( if I am correct it is not yet supported on streaming datasets)


These are my opinions and they are not facts, just opinions so to speak :)


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:

> I think these approaches are good, but there are limitations (eg dynamic
> scaling) without us making changes inside of the Spark Kube scheduler.
>
> Certainly whichever scheduler extensions we add support for we should
> collaborate with the people developing those extensions insofar as they are
> interested. My first place that I checked was #sig-scheduling which is
> fairly quite on the Kubernetes slack but if there are more places to look
> for folks interested in batch scheduling on Kubernetes we should definitely
> give it a shot :)
>
> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Regarding your point and I quote
>>
>> "..  I know that one of the Spark on Kube operators
>> supports volcano/kube-batch so I was thinking that might be a place I would
>> start exploring..."
>>
>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>> Computing Foundation  (CNCF). For example through
>> https://github.com/volcano-sh/volcano
>>
> 
>>
>> There may be value-add in collaborating with such groups through CNCF in
>> order to have a collective approach to such work. There also seems to be
>> some work on Integration of Spark with Volcano for Batch Scheduling.
>> 
>>
>>
>>
>> What is not very clear is the degree of progress of these projects. You
>> may be kind enough to elaborate on KPI for each of these projects and where
>> you think your contributions is going to be.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 00:44, Holden Karau  wrote:
>>
>>> Hi Folks,
>>>
>>> I'm continuing my adventures to make Spark on containers party and I
>>> was wondering if folks have experience with the different batch
>>> scheduler options that they prefer? I was thinking so that we can
>>> better support dynamic allocation it might make sense for us to
>>> support using different schedulers and I wanted to see if there are
>>> any that the community is more interested in?
>>>
>>> I know that one of the Spark on Kube operators supports
>>> volcano/kube-batch so I was thinking that might be a place I start
>>> exploring but also want to be open to other schedulers that folks
>>> might be interested in.
>>>
>>> Cheers,
>>>
>>> Holden :)