Re: Dataproc serverless for Spark

Mich Talebzadeh Mon, 21 Nov 2022 13:47:09 -0800

I have not used standalone for a good while. The standard dataproc uses
YARN as the resource manager. The vanilla dataproc is Google's answer to
Hadoop on the cloud. Move your analytics workload from on-premise to Cloud
with little effort with the same look and feel. Google then introduced  dynamic
allocation of resources to cater for those apps that could not be easily
migrated to Kubernetes (GKE). so the  doc states that  without dynamic
allocation, it only asks for containers at the beginning of the job. With
dynamic allocation, it will remove containers, or ask for new ones, as
necessary. This is still using YARN. See here
<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>

<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
This
approach was as not necessarily very successful as adding executors
dynamically for larger workloads could freeze the spark application itself.
Reading the doc it says startup time for serverless is 60 seconds compared
to dataproc on Compute engine (the one you setup your own spark cluster on
dataproc tin boxes) of 90 seconds

Dataproc serverless for Spark autoscaling
<https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling> makes
a reference to  "Dataproc Serverless autoscaling is the default behavior,
and uses Spark dynamic resource allocation
<https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation>
to
determine whether, how, and when to scale your workload" So the key point
is Not standalone mode but generally references to "Spark provides a
mechanism to dynamically adjust the resources your application occupies
based on the workload. This means that your application may give resources
back to the cluster if they are no longer used and request them again later
when there is demand. This feature is particularly useful if multiple
applications share resources in your Spark cluster."

Is'nt this the standard Spark resource allocation? So why has this suddenly
been elevated from Spark 3.2?

Someone may give a more qualified answer here :)

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 21 Nov 2022 at 17:32, Stephen Boesch <java...@gmail.com> wrote:

> Out of curiosity : are there functional limitations in Spark Standalone
> that are of concern?  Yarn is more configurable for running non-spark
> workloads and how to run multiple spark jobs in parallel. But for a single
> spark job it seems standalone launches more quickly and does not miss any
> features. Are there specific limitations you are aware of / run into?
>
> stephen b
>
> On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have not tested this myself but Google have brought up *Dataproc Serverless
>> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
>> workloads without requiring you to provision and manage your own cluster.
>> Specify workload parameters, and then submit the workload to the Dataproc
>> Serverless service. The service will run the workload on a managed compute
>> infrastructure, autoscaling resources as needed. Dataproc Serverless
>> charges apply only to the time when the workload is executing. Google
>> Dataproc is similar to Amazon EMR
>>
>> So in short you don't need to provision your own Dataproc cluster etc.
>> One thing Inoticed from release doc
>> <https://cloud.google.com/dataproc-serverless/docs/overview>is that the
>> resource management is *spark based a*s opposed to standard Dataproc
>> which iis YARN based. It is available for Spark 3.2. My assumption is
>> that by Spark based it means that spark is running in standalone mode. Has
>> there been much improvement in release 3.2 for standalone mode?
>>
>> Thanks
>>
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Dataproc serverless for Spark

Reply via email to