Hi,

I would still not build any custom solution, and if in GCP use serverless
Dataproc. I think that it is always better to be hands on with AWS Glue
before commenting on it.

Regards,
Gourav Sengupta

On Mon, Feb 14, 2022 at 11:18 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through 
> Support
> for Customized Kubernetes Scheduler
> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker 
> *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
> believe is similar to AWS Glue
> <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> <https://beam.apache.org/documentation/runners/spark/>, which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi,
>> may be this is useful in case someone is testing SPARK in containers for
>> developing SPARK.
>>
>> *From a production scale work point of view:*
>> But if I am in AWS, I will just use GLUE if I want to use containers for
>> SPARK, without massively increasing my costs for operations unnecessarily.
>>
>> Also, in case I am not wrong, GCP already has SPARK running in serverless
>> mode.  Personally I would never create the overhead of additional costs and
>> issues to my clients of deploying SPARK when those solutions are already
>> available by Cloud vendors. Infact, that is one of the precise reasons why
>> people use cloud - to reduce operational costs.
>>
>> Sorry, just trying to understand what is the scope of this work.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> The equivalent of Google GKE autopilot
>>> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
>>>  in
>>> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>>>
>>>
>>> I have not used the AWS Fargate so I can only mension Google's GKE
>>> Autopilot.
>>>
>>>
>>> This is developed from the concept of containerization and
>>> microservices. In the standard mode of creating a GKE cluster users can
>>> customize their configurations based on the requirements, GKE manages the
>>> control plane and users manually provision and manage their node
>>> infrastructure. So you choose your hardware type and memory/CPU where your
>>> spark containers will be running and they will be shown as VM hosts in your
>>> account. In GKE Autopilot mode, GKE manages the nodes, pre-configures
>>> the cluster with adds-on for auto-scaling, auto-upgrades, maintenance, Day
>>> 2 operations and security hardening. So there is a lot there. You don't
>>> choose your nodes and their sizes. You are effectively paying for the pods
>>> you use.
>>>
>>>
>>> Within spark-submit, you still need to specify the number of executors,
>>> driver and executor memory plus cores for each driver and executor when
>>> doing spark-submit. The theory is that the k8s cluster will deploy suitable
>>> nodes and will create enough pods on those nodes. With the standard k8s
>>> cluster you choose your nodes and you ensure that one core on each node is
>>> reserved for the OS itself. Otherwise if you allocate all cores to spark
>>> with --conf spark.executor.cores, you will receive this error
>>>
>>>
>>> kubctl describe pods -n spark
>>>
>>> ...
>>>
>>> Events:
>>>
>>>   Type     Reason             Age                 From
>>> Message
>>>
>>>   ----     ------             ----                ----
>>> -------
>>>
>>>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler
>>>  0/3 nodes are available: 3 Insufficient cpu.
>>>
>>> So with the standard k8s you have a choice of selecting your core sizes.
>>> With autopilot this node selection is left to autopilot to deploy suitable
>>> nodes and this will be a trial and error at the start (to get the
>>> configuration right). You may be lucky if the history of executions are
>>> kept current and the same job can be repeated. However, in my experience,
>>> to procedure the driver pod in "running state" is expensive timewise and
>>> without an executor in running state, there is no chance of spark job doing
>>> anything
>>>
>>>
>>> NAME                                         READY   STATUS    RESTARTS
>>>  AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
>>>   31s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>>>   3m4s
>>>
>>>
>>> NAME                                         READY   STATUS
>>> RESTARTS   AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>>>  0          112s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running
>>>  0          4m25s
>>>
>>> NAME                                         READY   STATUS    RESTARTS
>>>  AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
>>>   114s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>>>   4m27s
>>>
>>> Basically I told Spak to have 6 executors but could only bring into
>>> running state one executor after the driver pod spinning for 4 minutes.
>>>
>>> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring
>>> K8S client using current context from users K8S config file
>>>
>>> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
>>> spark.dynamicAllocation.initialExecutors,
>>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>>
>>> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3
>>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO Utils: Successfully started service
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>>>
>>> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
>>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>>>
>>> 22/02/11 20:16:20 INFO BlockManager: Using
>>> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
>>> policy
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
>>> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
>>> RAM, BlockManagerId(driver,
>>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
>>> spark.dynamicAllocation.initialExecutors,
>>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>>
>>> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
>>> without a shuffle service is an experimental feature.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3
>>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend:
>>> SchedulerBackend is ready for scheduling beginning after waiting
>>> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>>>
>>> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
>>> ('null') to the value of spark.sql.warehouse.dir
>>> ('file:/opt/spark/work-dir/spark-warehouse').
>>>
>>> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
>>> 'file:/opt/spark/work-dir/spark-warehouse'.
>>>
>>> OK there is a lot to digest here and I appreciate feedback from other
>>> members that have experimented with GKE autopilot or AWS Fargate or are
>>> familiar with k8s.
>>>
>>> Thanks
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Reply via email to