Re: A scene with unstable Spark performance

2022-05-17 Thread Sungwoo Park
The problem you describe is the motivation for developing Spark on MR3.
>From the blog article (https://www.datamonad.com/post/2021-08-18-spark-mr3/
):

*The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods.*

The problem is due to an architectural limitation of Spark, and I guess
fixing the problem would require a heavy rewrite of Spark core. When we
developed Spark on MR3, we were not aware of any attempt being made
elsewhere (in academia and industry) to address this limitation.

A potential workaround might be to implement a custom Spark application
that manages the submission of two groups of Spark jobs and controls their
execution (similarly to Spark Thrift Server). Not sure if this approach
would fix your problem, though.

If you are interested, see the webpage of Spark on MR3:
https://mr3docs.datamonad.com/docs/spark/

We have released Spark 3.0.1 on MR3, and Spark 3.2.1 on MR3 is under
development. For Spark 3.0.1 on MR3, no change is made to Spark and MR3 is
used as an add-on. The main application of MR3 is Hive on MR3, but Spark on
MR3 is equally ready for production.

Thank you,

--- Sungwoo

>


Re: A scene with unstable Spark performance

2022-05-17 Thread Bowen Song
Hi,

Spark dynamic resource allocation cannot solve my problem, because the 
resources of the production environment are limited. I expect that under this 
premise, by reserving resources to ensure that job tasks of different groups 
can be scheduled in time.

Thank you,
Bowen Song


From: Qian SUN 
Sent: Wednesday, May 18, 2022 9:32
To: Bowen Song 
Cc: user.spark 
Subject: Re: A scene with unstable Spark performance

Hi. I think you need Spark dynamic resource allocation. Please refer to 
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation.
And If you use Spark SQL, AQE maybe help. 
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Bowen Song mailto:bowen.s...@kyligence.io>> 
于2022年5月17日周二 22:33写道:

Hi all,



I find Spark performance is unstable in this scene: we divided the jobs into 
two groups according to the job completion time. One group of jobs had an 
execution time of less than 10s, and the other group of jobs had an execution 
time from 10s to 300s. The reason for the difference is that the latter will 
scan more files, that is, the number of tasks will be larger. When the two 
groups of jobs were submitted to Spark for execution, I found that due to 
resource competition, the existence of the slower jobs made the original faster 
job take longer to return the result, which manifested as unstable Spark 
performance. The problem I want to solve is: Can we reserve certain resources 
for each of the two groups, so that the fast jobs can be scheduled in time, and 
the slow jobs will not be starved to death because the resources are completely 
allocated to the fast jobs.



In this context, I need to group spark jobs, and the tasks from different 
groups of jobs can be scheduled using group reserved resources. At the 
beginning of each round of scheduling, tasks in this group will be scheduled 
first, only when there are no tasks in this group to schedule, its resources 
can be allocated to other groups to avoid idling of resources.



For the consideration of resource utilization and the overhead of managing 
multiple clusters, I hope that the jobs can share the spark cluster, rather 
than creating private clusters for the groups.



I've read the code for the Spark Fair Scheduler, and the implementation doesn't 
seem to meet the need to reserve resources for different groups of job.



Is there a workaround that can solve this problem through Spark Fair Scheduler? 
If it can't be solved, would you consider adding a mechanism like capacity 
scheduling.



Thank you,

Bowen Song


--
Best!
Qian SUN


Re: A scene with unstable Spark performance

2022-05-17 Thread Qian SUN
Hi. I think you need Spark dynamic resource allocation. Please refer to
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
.
And If you use Spark SQL, AQE maybe help.
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Bowen Song  于2022年5月17日周二 22:33写道:

> Hi all,
>
>
>
> I find Spark performance is unstable in this scene: we divided the jobs
> into two groups according to the job completion time. One group of jobs had
> an execution time of less than 10s, and the other group of jobs had an
> execution time from 10s to 300s. The reason for the difference is that the
> latter will scan more files, that is, the number of tasks will be larger.
> When the two groups of jobs were submitted to Spark for execution, I found
> that due to resource competition, the existence of the slower jobs made the
> original faster job take longer to return the result, which manifested as
> unstable Spark performance. The problem I want to solve is: Can we reserve
> certain resources for each of the two groups, so that the fast jobs can be
> scheduled in time, and the slow jobs will not be starved to death because
> the resources are completely allocated to the fast jobs.
>
>
>
> In this context, I need to group spark jobs, and the tasks from different
> groups of jobs can be scheduled using group reserved resources. At the
> beginning of each round of scheduling, tasks in this group will be
> scheduled first, only when there are no tasks in this group to schedule,
> its resources can be allocated to other groups to avoid idling of resources.
>
>
>
> For the consideration of resource utilization and the overhead of managing
> multiple clusters, I hope that the jobs can share the spark cluster, rather
> than creating private clusters for the groups.
>
>
>
> I've read the code for the Spark Fair Scheduler, and the implementation
> doesn't seem to meet the need to reserve resources for different groups of
> job.
>
>
>
> Is there a workaround that can solve this problem through Spark Fair
> Scheduler? If it can't be solved, would you consider adding a mechanism
> like capacity scheduling.
>
>
>
> Thank you,
>
> Bowen Song
>


-- 
Best!
Qian SUN


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread Sean Owen
I don't think that is standard SQL? what are you trying to do, and why not
do it outside SQL?

On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
wrote:

> Gentle ping. Any info here would be great.
>
> Regards,
> Ram
>
> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
> wrote:
>
>> Hello Spark Users Group,
>>
>> I've just recently started working on tools that use Apache Spark.
>> When I try WAITFOR in the spark-sql command line, I just get:
>>
>> Error: Error running query:
>> org.apache.spark.sql.catalyst.parser.ParseException:
>>
>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>
>>
>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>> sleep for a desired period of time? I'm trying to test a timeout issue and
>> need to simulate a sleep behavior.
>>
>>
>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>> did not find any good matches searching online.
>>
>> Thanks,
>> Ram
>>
>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread K. N. Ramachandran
Gentle ping. Any info here would be great.

Regards,
Ram

On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
wrote:

> Hello Spark Users Group,
>
> I've just recently started working on tools that use Apache Spark.
> When I try WAITFOR in the spark-sql command line, I just get:
>
> Error: Error running query:
> org.apache.spark.sql.catalyst.parser.ParseException:
>
> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>
>
> 1) Why is WAITFOR not allowed? Is there another way to get a process to
> sleep for a desired period of time? I'm trying to test a timeout issue and
> need to simulate a sleep behavior.
>
>
> 2) Is there documentation that outlines why WAITFOR is not supported? I
> did not find any good matches searching online.
>
> Thanks,
> Ram
>


-- 
K.N.Ramachandran
Ph: 814-441-4279


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Yes, it should be possible, any interest to work on this together? Need
more hands to add more features here :)

On Tue, May 17, 2022 at 2:06 PM Holden Karau  wrote:

> Could we make it do the same sort of history server fallback approach?
>
> On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:
>
>> It is like Web Application Proxy in YARN (
>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
>> to provide easy access for Spark UI when the Spark application is running.
>>
>> When running Spark on Kubernetes with S3, there is no YARN. The reverse
>> proxy here is to behave like that Web Application Proxy. It will
>> simplify settings to access Spark UI on Kubernetes.
>>
>>
>> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>>
>>> what's the advantage of using reverse proxy for spark UI?
>>>
>>> Thanks
>>>
>>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>>
 Hi Spark Folks,

 I built a web reverse proxy to access Spark UI on Kubernetes (working
 together with
 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
 share here in case other people have similar need.

 The reverse proxy code is here:
 https://github.com/datapunchorg/spark-ui-reverse-proxy

 Let me know if anyone wants to use or would like to contribute.

 Thanks,
 Bo

 --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach?

On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:

> It is like Web Application Proxy in YARN (
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
> to provide easy access for Spark UI when the Spark application is running.
>
> When running Spark on Kubernetes with S3, there is no YARN. The reverse
> proxy here is to behave like that Web Application Proxy. It will
> simplify settings to access Spark UI on Kubernetes.
>
>
> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>
>> what's the advantage of using reverse proxy for spark UI?
>>
>> Thanks
>>
>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>
>>> Hi Spark Folks,
>>>
>>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>>> together with
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>>> share here in case other people have similar need.
>>>
>>> The reverse proxy code is here:
>>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>>
>>> Let me know if anyone wants to use or would like to contribute.
>>>
>>> Thanks,
>>> Bo
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
It is like Web Application Proxy in YARN (
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
to provide easy access for Spark UI when the Spark application is running.

When running Spark on Kubernetes with S3, there is no YARN. The reverse
proxy here is to behave like that Web Application Proxy. It will
simplify settings to access Spark UI on Kubernetes.


On Mon, May 16, 2022 at 11:46 PM wilson  wrote:

> what's the advantage of using reverse proxy for spark UI?
>
> Thanks
>
> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Thanks Holden :)

On Mon, May 16, 2022 at 11:12 PM Holden Karau  wrote:

> Oh that’s rad 
>
> On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


A scene with unstable Spark performance

2022-05-17 Thread Bowen Song
Hi all,

I find Spark performance is unstable in this scene: we divided the jobs into 
two groups according to the job completion time. One group of jobs had an 
execution time of less than 10s, and the other group of jobs had an execution 
time from 10s to 300s. The reason for the difference is that the latter will 
scan more files, that is, the number of tasks will be larger. When the two 
groups of jobs were submitted to Spark for execution, I found that due to 
resource competition, the existence of the slower jobs made the original faster 
job take longer to return the result, which manifested as unstable Spark 
performance. The problem I want to solve is: Can we reserve certain resources 
for each of the two groups, so that the fast jobs can be scheduled in time, and 
the slow jobs will not be starved to death because the resources are completely 
allocated to the fast jobs.

In this context, I need to group spark jobs, and the tasks from different 
groups of jobs can be scheduled using group reserved resources. At the 
beginning of each round of scheduling, tasks in this group will be scheduled 
first, only when there are no tasks in this group to schedule, its resources 
can be allocated to other groups to avoid idling of resources.

For the consideration of resource utilization and the overhead of managing 
multiple clusters, I hope that the jobs can share the spark cluster, rather 
than creating private clusters for the groups.

I've read the code for the Spark Fair Scheduler, and the implementation doesn't 
seem to meet the need to reserve resources for different groups of job.

Is there a workaround that can solve this problem through Spark Fair Scheduler? 
If it can't be solved, would you consider adding a mechanism like capacity 
scheduling.

Thank you,
Bowen Song


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread wilson
what's the advantage of using reverse proxy for spark UI?

Thanks

On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:

> Hi Spark Folks,
>
> I built a web reverse proxy to access Spark UI on Kubernetes (working
> together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
> Want to share here in case other people have similar need.
>
> The reverse proxy code is here:
> https://github.com/datapunchorg/spark-ui-reverse-proxy
>
> Let me know if anyone wants to use or would like to contribute.
>
> Thanks,
> Bo
>
>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Oh that’s rad 

On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:

> Hi Spark Folks,
>
> I built a web reverse proxy to access Spark UI on Kubernetes (working
> together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
> Want to share here in case other people have similar need.
>
> The reverse proxy code is here:
> https://github.com/datapunchorg/spark-ui-reverse-proxy
>
> Let me know if anyone wants to use or would like to contribute.
>
> Thanks,
> Bo
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau