For Kubernetes yes -- we use AWS EKS. We manage the Spark cluster ourselves and it is deployed via Helm chart.
On Wed, Feb 2, 2022 at 8:32 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Is this part of cloud service offerings like Google GKE? > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 2 Feb 2022 at 13:17, Han Lin <h...@twistbioscience.com> wrote: > >> Hello, >> >> Before I ask my questions, let me say what I am trying to do and briefly >> describe the setup I have so far. I am basically building an API service >> that serves a ML model which uses Spark ML. I have Spark deployed in >> Kubernetes in standalone mode (the default Spark manager) with 2 worker >> nodes, each worker running as a pod. The API app itself is in Python using >> FastAPI, and uses PySpark to start a Spark application in the Spark >> cluster. The Spark Driver is created on app startup and lives in the same >> pod as the app, so the Spark application is expected to run for as long as >> the API pod is alive, which is indefinitely. I understand that this is not >> the typical pattern, since usually a Spark application is expected to >> complete at some point, as a kind of workload that eventually returns a >> result. >> When load on the API increases, I would like to have the Spark cluster >> scale up automatically by increasing the number of worker replicas, and >> have Spark application scale up by using those new workers in the cluster. >> For controlling horizontal pod scaling in Kubernetes we use KEDA >> <https://keda.sh/>, and in this case I will probably use certain >> Prometheus metrics exposed by Spark Master as the scaling trigger. This >> leads to my questions. >> >> 1. I initially thought the number of pending jobs (jobs waiting for >> available executors) in the Spark cluster would be a good metric to use as >> a scaling trigger. But after some digging, I found that "pending" is not >> one of the statuses of jobs. Spark doesn't seem to have an obvious concept >> of a job-queue. Are there any metrics that would give some indication of >> the number of jobs in the cluster waiting to be run? >> 2. Are there other Spark metrics that would make sense to use as a >> scaling trigger? >> 3. Does this setup of having perpetual-running Spark applications even >> make much sense in the first place? >> >> Thanks very much in advance for any advice! >> >