I have not used standalone for a good while. The standard dataproc uses YARN as the resource manager. The vanilla dataproc is Google's answer to Hadoop on the cloud. Move your analytics workload from on-premise to Cloud with little effort with the same look and feel. Google then introduced dynamic allocation of resources to cater for those apps that could not be easily migrated to Kubernetes (GKE). so the doc states that without dynamic allocation, it only asks for containers at the beginning of the job. With dynamic allocation, it will remove containers, or ask for new ones, as necessary. This is still using YARN. See here <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark> This approach was as not necessarily very successful as adding executors dynamically for larger workloads could freeze the spark application itself. Reading the doc it says startup time for serverless is 60 seconds compared to dataproc on Compute engine (the one you setup your own spark cluster on dataproc tin boxes) of 90 seconds Dataproc serverless for Spark autoscaling <https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling> makes a reference to "Dataproc Serverless autoscaling is the default behavior, and uses Spark dynamic resource allocation <https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation> to determine whether, how, and when to scale your workload" So the key point is Not standalone mode but generally references to "Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster." Is'nt this the standard Spark resource allocation? So why has this suddenly been elevated from Spark 3.2? Someone may give a more qualified answer here :) view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 21 Nov 2022 at 17:32, Stephen Boesch <java...@gmail.com> wrote: > Out of curiosity : are there functional limitations in Spark Standalone > that are of concern? Yarn is more configurable for running non-spark > workloads and how to run multiple spark jobs in parallel. But for a single > spark job it seems standalone launches more quickly and does not miss any > features. Are there specific limitations you are aware of / run into? > > stephen b > > On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi, >> >> I have not tested this myself but Google have brought up *Dataproc Serverless >> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch >> workloads without requiring you to provision and manage your own cluster. >> Specify workload parameters, and then submit the workload to the Dataproc >> Serverless service. The service will run the workload on a managed compute >> infrastructure, autoscaling resources as needed. Dataproc Serverless >> charges apply only to the time when the workload is executing. Google >> Dataproc is similar to Amazon EMR >> >> So in short you don't need to provision your own Dataproc cluster etc. >> One thing Inoticed from release doc >> <https://cloud.google.com/dataproc-serverless/docs/overview>is that the >> resource management is *spark based a*s opposed to standard Dataproc >> which iis YARN based. It is available for Spark 3.2. My assumption is >> that by Spark based it means that spark is running in standalone mode. Has >> there been much improvement in release 3.2 for standalone mode? >> >> Thanks >> >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >