Comparison of Trino, Spark, and Hive-MR3

2023-05-31 Thread Sungwoo Park
Hi everyone,

We published an article on the performance and correctness of Trino, Spark,
and Hive-MR3, and thought that it could be of interest to Spark users.

https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/

Omitted in the article is the performance of Spark 2.3.1 vs 2.4.0. On the
same 10TB TPC-DS benchmark:

With Spark 3.2.1, it takes 27104 seconds to complete all 99 queries.
With Spark 3.4.0, it takes 19669 seconds to complete all 99 queries.

In both cases, all the queries return correct results.

Thanks,

--- Sungwoo


Re: Help with Shuffle Read performance

2022-09-30 Thread Sungwoo Park
Hi Leszek,

For running YARN on Kubernetes and then running Spark on YARN, is there a
lot of overhead for maintaining YARN on Kubernetes? I thought people
usually want to move from YARN to Kubernetes because of the overhead of
maintaining Hadoop.

Thanks,

--- Sungwoo


On Fri, Sep 30, 2022 at 1:37 PM Leszek Reimus 
wrote:

> Hi Everyone,
>
> To add my 2 cents here:
>
> Advantage of containers, to me, is that it leaves the host system pristine
> and clean, allowing standardized devops deployment of hardware for any
> purpose. Way back before - when using bare metal / ansible, reusing hw
> always involved full reformat of base system. This alone is worth the ~1-2%
> performance tax cgroup containers have.
>
> Advantage of kubernetes is more on the deployment side of things. Unified
> deployment scripts that can be written by devs. Same deployment yaml (or
> helm chart) can be used on local Dev Env / QA / Integration Env and finally
> Prod (with some tweaks).
>
> Depending on the networking CNI, and storage backend - Kubernetes can have
> a very close to bare metal performance. In the end it is always a
> trade-off. You gain some, you pay with extra overhead.
>
> I'm running YARN on kubernetes and mostly run Spark on top of YARN (some
> legacy MapReduce jobs too though) . Finding it much more manageable to
> allocate larger memory/cpu chunks to yarn pods and then have run
> auto-scaler to scale out YARN if needed; than to manage individual
> memory/cpu requirements on Spark on Kubernetes deployment.
>
> As far as I tested, Spark on Kubernetes is immature when reliability is
> concerned (or maybe our homegrown k8s does not do fencing/STONITH well
> yet). When a node dies / goes down, I find executors not getting
> rescheduled to other nodes - the driver just gets stuck for the executors
> to come back. This does not happen on YARN / Standalone deployment (even
> when ran on same k8s cluster)
>
> Sincerely,
>
> Leszek Reimus
>
>
>
>
> On Thu, Sep 29, 2022 at 7:06 PM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> dont containers finally run on systems, and the only advantage of
>> containers is that you can do better utilisation of system resources by
>> micro management of jobs running in it? Some say that containers have their
>> own binaries which isolates environment, but that is a lie, because in a
>> kubernetes environments that is running your SPARK jobs you will have the
>> same environment for all your kubes.
>>
>> And as you can see there are several other configurations, disk mounting,
>> security, etc issues to handle as an overhead as well.
>>
>> And the entire goal of all those added configurations is that someone in
>> your devops team feels using containers makes things more interesting
>> without any real added advantage to large volume jobs.
>>
>> But I may be wrong, and perhaps we need data, and not personal attacks
>> like the other person in the thread did.
>>
>> In case anyone does not know EMR does run on containers as well, and in
>> EMR running on EC2 nodes you can put all your binaries in containers and
>> use those for running your jobs.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Thu, Sep 29, 2022 at 7:46 PM Vladimir Prus 
>> wrote:
>>
>>> Igor,
>>>
>>> what exact instance types do you use? Unless you use local instance
>>> storage and have actually configured your Kubernetes and Spark to use
>>> instance storage, your 30x30 exchange can run into EBS IOPS limits. You can
>>> investigate that by going to an instance, then to volume, and see
>>> monitoring charts.
>>>
>>> Another thought is that you're essentially giving 4GB per core. That
>>> sounds pretty low, in my experience.
>>>
>>>
>>>
>>> On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria 
>>> wrote:
>>>
 Hi Everyone,

 I'm running spark 3.2 on kubernetes and have a job with a decently
 sized shuffle of almost 4TB. The relevant cluster config is as follows:

 - 30 Executors. 16 physical cores, configured with 32 Cores for spark
 - 128 GB RAM
 -  shuffle.partitions is 18k which gives me tasks of around 150~180MB

 The job runs fine but I'm bothered by how underutilized the cluster
 gets during the reduce phase. During the map(reading data from s3 and
 writing the shuffle data) CPU usage, disk throughput and network usage is
 as expected, but during the reduce phase it gets really low. It seems the
 main bottleneck is reading shuffle data from other nodes, task statistics
 reports values ranging from 25s to several minutes(the task sizes are
 really close, they aren't skewed). I've tried increasing
 "spark.reducer.maxSizeInFlight" and
 "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
 a little, but not enough to saturate the cluster resources.

 Did I miss some more tuning parameters that could help?
 One obvious thing would be to vertically increase the machines and use
 less nodes to minimize traffic, but 30 nodes doesn't 

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Yes, we can get reduce tasks started when there are enough resources in the
cluster. As you point out, reduce tasks cannot produce their output while
map tasks are still running, but they can prefetch the output of map tasks.
In our prototype implementation of pipelined execution, everything works as
intended, but for typical Spark jobs (like SparkSQL jobs), we don't see
noticeable performance improvement because Spark tasks are mostly
short-running tasks. My question was if there would be some category of
Spark jobs that would benefit from pipelined execution.

Thanks,

--- Sungwoo

On Thu, Sep 8, 2022 at 7:51 AM Sean Owen  wrote:

> Wait, how do you start reduce tasks before maps are finished? is the idea
> that some reduce tasks don't depend on all the maps, or at least you can
> get started?
> You can already execute unrelated DAGs in parallel of course.
>
> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>>
>>>> Hello Spark users,
>>>>
>>>> I have a question on the architecture of Spark (which could lead to a
>>>> research problem). In its current implementation, Spark finishes executing
>>>> all the tasks in a stage before proceeding to child stages. For example,
>>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>>> tasks before scheduling reduce tasks.
>>>>
>>>> We can think of another 'pipelined execution' strategy in which tasks
>>>> in child stages can be scheduled and executed concurrently with tasks in
>>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>>> tasks are being executed, we could schedule and execute reduce tasks in
>>>> advance if the cluster has enough resources. These reduce tasks can also
>>>> pre-fetch the output of map tasks.
>>>>
>>>> Has anyone seen Spark jobs for which this 'pipelined execution'
>>>> strategy would be desirable while the current implementation is not quite
>>>> adequate? Since Spark tasks usually run for a short period of time, I guess
>>>> the new strategy would not have a major performance improvement. However,
>>>> there might be some category of Spark jobs for which this new strategy
>>>> would be clearly a better choice.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo
>>>>
>>>>


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
You are right -- Spark can't do this with its current architecture. My
question was: if there was a new implementation supporting pipelined
execution, what kind of Spark jobs would benefit (a lot) from it?

Thanks,

--- Sungwoo

On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
wrote:

> I don't think Spark can do this with its current architecture. It has to
> wait for the step to be done, speculative execution isn't possible. Others
> probably know more about why that is.
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>
>> Hello Spark users,
>>
>> I have a question on the architecture of Spark (which could lead to a
>> research problem). In its current implementation, Spark finishes executing
>> all the tasks in a stage before proceeding to child stages. For example,
>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>> tasks before scheduling reduce tasks.
>>
>> We can think of another 'pipelined execution' strategy in which tasks in
>> child stages can be scheduled and executed concurrently with tasks in
>> parent stages. For example, for the two-stage map-reduce DAG, while map
>> tasks are being executed, we could schedule and execute reduce tasks in
>> advance if the cluster has enough resources. These reduce tasks can also
>> pre-fetch the output of map tasks.
>>
>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>> would be desirable while the current implementation is not quite adequate?
>> Since Spark tasks usually run for a short period of time, I guess the new
>> strategy would not have a major performance improvement. However, there
>> might be some category of Spark jobs for which this new strategy would be
>> clearly a better choice.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>


Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Hello Spark users,

I have a question on the architecture of Spark (which could lead to a
research problem). In its current implementation, Spark finishes executing
all the tasks in a stage before proceeding to child stages. For example,
given a two-stage map-reduce DAG, Spark finishes executing all the map
tasks before scheduling reduce tasks.

We can think of another 'pipelined execution' strategy in which tasks in
child stages can be scheduled and executed concurrently with tasks in
parent stages. For example, for the two-stage map-reduce DAG, while map
tasks are being executed, we could schedule and execute reduce tasks in
advance if the cluster has enough resources. These reduce tasks can also
pre-fetch the output of map tasks.

Has anyone seen Spark jobs for which this 'pipelined execution' strategy
would be desirable while the current implementation is not quite adequate?
Since Spark tasks usually run for a short period of time, I guess the new
strategy would not have a major performance improvement. However, there
might be some category of Spark jobs for which this new strategy would be
clearly a better choice.

Thanks,

--- Sungwoo


Re: [Spark][Core] Resource Allocation

2022-07-15 Thread Sungwoo Park
For 1), this is a recurring question in this mailing list, and the answer
is: no, Spark does not support the coordination between multiple Spark
applications. Spark relies on an external resource manager, such as Yarn
and Kubernetes, to allocate resources to multiple Spark applications. For
example, to achieve a fair allocation of resources on Yarn, one should
configure Yarn Fair Scheduler.

Databricks seems to have their own solution to this problem (with the
multi-cluster optimization option). For Apache Spark, there is an extension
called Spark-MR3 which can manage resources among multiple Spark
applications. If you are interested, see the blog article:
https://www.datamonad.com/post/2021-08-18-spark-mr3/
>From the blog:

*The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods.*

We have released Spark 3.0.3 on MR3, and Spark 3.2.1 on MR3 will be
released sometime soon.
If you are further interested, see the webpage of Spark on MR3:
https://mr3docs.datamonad.com/docs/spark/

--- Sungwoo

On Wed, Jul 13, 2022 at 4:55 AM Amin Borjian 
wrote:

> I have some problems that I am looking for if there is no solution for
> them (due to the current implementation) or if there is a way and I was not
> aware of it.
>
>
>
> 1)
>
>
>
> Currently, we can enable and configure dynamic resource allocation based
> on below documentation.
>
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
>
>
> Based on documentation, it is possible to use an initial value of
> executors at first, and if some tasks are idle, use more executors. Also,
> if some executors were idle and we didn't have more tasks, executors will
> be killed (to be used by others). My question is for when we have 2
> SparkContext (Separate Applications). In such cases, I expect the dynamic
> method to work as fairly as possible and distribute resources equally. But
> what I observe is that if SparkContext 1 uses all of the executors due to
> having running tasks, it will not release them until it has no more tasks
> to run and executors become idle. While Spark could avoid executing the new
> tasks of the SparkContext 1 (because it is not logical to kill the running
> tasks) and instead make executors free for SparkContext 2, it didn't do so.
> I do not found any configuration for it. Have I understood correctly? And
> is there no way to achieve a fair dynamic allocation between contexts?
>
>
>
> 2)
>
>
>
> In dynamic or even static resource allocation, Spark must run a series of
> executors from among the resources in the cluster (workers). The data that
> exists on the cluster has as little skew and is distributed throughout the
> cluster. For this reason, it is better for executors to be distributed as
> much as possible at the cluster in order to benefit from the data locality.
> But what I observe is that Spark sometimes executes 2 or more executors on
> a same worker even if there are some idle workers. Is this intentional and
> there are other reasons for improvement, or is it a better way and not
> currently supported by Spark?
>


Re: A scene with unstable Spark performance

2022-05-17 Thread Sungwoo Park
The problem you describe is the motivation for developing Spark on MR3.
>From the blog article (https://www.datamonad.com/post/2021-08-18-spark-mr3/
):

*The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods.*

The problem is due to an architectural limitation of Spark, and I guess
fixing the problem would require a heavy rewrite of Spark core. When we
developed Spark on MR3, we were not aware of any attempt being made
elsewhere (in academia and industry) to address this limitation.

A potential workaround might be to implement a custom Spark application
that manages the submission of two groups of Spark jobs and controls their
execution (similarly to Spark Thrift Server). Not sure if this approach
would fix your problem, though.

If you are interested, see the webpage of Spark on MR3:
https://mr3docs.datamonad.com/docs/spark/

We have released Spark 3.0.1 on MR3, and Spark 3.2.1 on MR3 is under
development. For Spark 3.0.1 on MR3, no change is made to Spark and MR3 is
used as an add-on. The main application of MR3 is Hive on MR3, but Spark on
MR3 is equally ready for production.

Thank you,

--- Sungwoo

>


Re:

2022-04-02 Thread Sungwoo Park
MR3 is a new execution engine, so there are quite a few differences on the
backend side. Some of the differences are:

1. Easier to install and run (e.g., no need to upgrade Hadoop)
2. Faster (because Hive on MR3 supports LLAP mode and runs as fast as
Hive-LLAP)
3. More efficient  - unlike Tez, a single master can handle multiple DAGs
and a worker can execute many tasks at once (like Spark executors and
Hive-LLAP daemons)
4. Provides native support for Kubernetes

The following page contains some more details on the difference from
Hive-LLAP.

https://mr3docs.datamonad.com/docs/k8s/features/comparison-llap/

Thanks,

-- SW


On Sat, Apr 2, 2022 at 9:58 PM Bitfox  wrote:

> Nice reading. Can you give a comparison on Hive on MR3 and Hive on Tez?
>
> Thanks
>
> On Sat, Apr 2, 2022 at 7:17 PM Sungwoo Park  wrote:
>
>> Hi Spark users,
>>
>> We have published an article where we evaluate the performance of Spark
>> 2.3.8 and Spark 3.2.1 (along with Hive 3). If interested, please see:
>>
>> https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>>
>> --- SW
>>
>


[no subject]

2022-04-02 Thread Sungwoo Park
Hi Spark users,

We have published an article where we evaluate the performance of Spark
2.3.8 and Spark 3.2.1 (along with Hive 3). If interested, please see:

https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/

--- SW


[Announce] Spark on MR3

2021-08-19 Thread Sungwoo Park
Hi Spark users,

We would like to announce the release of Spark on MR3, which is Apache
Spark using MR3 as the execution backend. MR3 is a general purpose
execution engine for Hadoop and Kubernetes, and Hive on MR3 has been its
main application. Spark on MR3 is a new application of MR3.

The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods. It can be particularly useful in cloud environments where
Spark applications are created and destroyed frequently. We wrote a blog
article introducing Spark on MR3:

https://www.datamonad.com/post/2021-08-18-spark-mr3/

Currently we have released Spark 3.0.3 on MR3 1.3. For more details on
Spark on MR3, you can check out the user guide:

https://mr3docs.datamonad.com/docs/spark/

Thanks,

--- Sungwoo