Thanks Mich. Very insightful.

AK    On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:  
 
 Good question. However, we ought to look at what options we have so to speak. 
Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow



Spark on DataProc is proven and it is in useat many organizations, I have 
deployed it extensively. It is infrastructure asa service provided including 
Spark, Hadoop and other artefacts. You have tomanage cluster creation, automate 
cluster creation and tear down, submittingjobs etc. However, it is another 
stack that needs to be managed.It now has autoscaling(enables cluster worker VM 
autoscaling ) policy as well.


Spark on GKEis something newer. Worth adding that the Spark DEV team are 
working hard to improve the performanceof Spark on Kubernetes, for example, 
through Support forCustomized Kubernetes Scheduler. As I explained in the first 
thread, Spark on Kubernetes relies on containerisation.Containers make 
applications more portable. Moreover, they simplify thepackaging of 
dependencies, especially with PySpark and enable repeatable andreliable build 
workflows which is cost effective. They also reduce the overalldevops load and 
allow one to iterate on the code faster. From a purely costperspective it would 
be cheaper with Docker as you can share resourceswith your other services. You 
can create Spark docker with different versionsof Spark, Scala, Java, OS etc. 
That docker file is portable. Can be used onPrem, AWS, GCP etc in container 
registries and devops and data science peoplecan share it as well. Built once 
used by many. Kuberneteswith autopilot helps scale the nodes of the Kubernetes 
cluster depending on theload. That is what I am currently looking into.

With regard to Dataflow, which I believe issimilar to AWSGlue, it is a managed 
service for executing data processing patterns. Patternsor pipelines are built 
with the Apache Beam SDK,which is an open source programming model that 
supports Java, Python and GO. Itenables batch and streaming pipelines. You 
create your pipelines with an ApacheBeam program and then run them on the 
Dataflow service. TheApache Spark Runner can be used to execute Beam pipelines 
using Spark. When you run a job on Dataflow,it spins up a cluster of virtual 
machines, distributes the tasks in the job tothe VMs, and dynamically scales 
the cluster based on how the job is performing.As I understand both iterative 
processing and notebooks plus Machine learning withSpark ML are not currently 
supported by Dataflow

So we have three choiceshere. If you are migrating from on-prem 
Hadoop/spark/YARN set-up, you may gofor Dataproc which will provide the same 
look and feel. If you want to usemicroservices and containers in your event 
driven architecture, you can adopt dockerimages that run on Kubernetes 
clusters, including Multi-Cloud KubernetesCluster. Dataflow is probably best 
suited for green-field projects.  Lessoperational overhead, unified approach 
for batch and streaming pipelines.

So as ever your mileage varies. If you want to migratefrom your existing 
Hadoop/Spark cluster to GCP, or take advantage of yourexisting workforce, 
choose Dataproc or GKE. In many cases, a bigconsideration is that one already 
has a codebase written against a particularframework, and one just wants to 
deploy it on the GCP, so even if, say, theBeam programming mode/dataflow is 
superior to Hadoop, someone with a lot ofHadoop code might still choose 
Dataproc or GDE for the time being, rather thanrewriting their code on Beam to 
run on Dataflow.

 HTH




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

Hi,may be this is useful in case someone is testing SPARK in containers for 
developing SPARK. 
>From a production scale work point of view:But if I am in AWS, I will just use 
>GLUE if I want to use containers for SPARK, without massively increasing my 
>costs for operations unnecessarily. 
Also, in case I am not wrong, GCP already has SPARK running in serverless mode. 
 Personally I would never create the overhead of additional costs and issues to 
my clients of deploying SPARK when those solutions are already available by 
Cloud vendors. Infact, that is one of the precise reasons why people use cloud 
- to reduce operational costs.
Sorry, just trying to understand what is the scope of this work.

Regards,Gourav Sengupta
On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

The equivalent of Google GKE autopilot in AWS is AWS Fargate




I have not used the AWS Fargate so I can only mension Google's GKE Autopilot.




This is developed from the concept of containerization and microservices. In 
the standard mode of creating a GKE cluster userscan customize their 
configurations based on the requirements, GKE manages thecontrol plane and 
users manually provision and manage their nodeinfrastructure. So you choose 
your hardware type and memory/CPU where your spark containers will be running 
and they will be shown as VM hosts in your account. In GKE Autopilot mode, GKE 
manages the nodes,pre-configures the cluster with adds-on for auto-scaling, 
auto-upgrades, maintenance, Day2 operations and security hardening. So there is 
a lot there. You don't choose your nodes and their sizes. You are effectively 
paying for the pods you use.




Within spark-submit, you still need to specify the number of executors, driver 
and executor memory plus cores for each driver and executor when doing 
spark-submit. The theory is that the k8s cluster will deploy suitable nodes and 
will create enough pods on those nodes. With the standard k8s cluster you 
choose your nodes and you ensure that one core on each node is reserved for the 
OS itself. Otherwise if you allocate all cores to spark with --conf 
spark.executor.cores, you will receive this error




kubctl describe pods -n spark

...

Events:

  Type     Reason             Age                 From                Message

  ----     ------             ----                ----                -------

  Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3 nodes 
are available: 3 Insufficient cpu.

So with the standard k8s you have a choice of selecting your core sizes. With 
autopilot this node selection is left to autopilot to deploy suitable nodes and 
this will be a trial and error at the start (to get the configuration right). 
You may be lucky if the history of executions are kept current and the same job 
can be repeated. However, in my experience, to procedure the driver pod in 
"running state" is expensive timewise and  without an executor in running 
state, there is no chance of spark job doing anything 



NAME                                         READY   STATUS    RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0          31s



sparkbq-37405a7eea6b9468-driver              1/1     Running   0          3m4s




NAME                                         READY   STATUS              
RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating   0      
    112s

sparkbq-37405a7eea6b9468-driver              1/1     Running             0      
    4m25s

NAME                                         READY   STATUS    RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0          114s



sparkbq-37405a7eea6b9468-driver              1/1     Running   0          4m27s

Basically I told Spak to have 6 executors but could only bring into running 
state one executor after the driver pod spinning for 4 minutes. 

22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
client using current context from users K8S config file

22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of 
spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors 
and spark.executor.instances

22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors from 
Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:20 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.

22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on 
sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079

22/02/11 20:16:20 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy

22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, 
None)

22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block manager 
sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB RAM, 
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, 
None)

22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, 
None)

22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, 
None)

22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of 
spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors 
and spark.executor.instances

22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation without a 
shuffle service is an experimental feature.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors from 
Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, 
skipping shutdown script

22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is 
ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 
30000000000(ns)

22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/opt/spark/work-dir/spark-warehouse').



22/02/11 20:16:49 INFO SharedState: Warehouse path is 
'file:/opt/spark/work-dir/spark-warehouse'.

OK there is a lot to digest here and I appreciate feedback from other members 
that have experimented with GKE autopilot or AWS Fargate or are familiar with 
k8s.
Thanks



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


  

Reply via email to