There might be 3 options: 1. Just as you expect, only ONE application, ONE rdd with regioned containers and executors automatically allocated and distributed, the ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495) may meet the requirement, treating Region as a type of resource just like GPU. But you have to wait for the full feature finished. And I can image the trouble shooting challenges. 2. Label Yarn nodes with region tag, group them into queues and submit the different jobs for different regions into dedicate queues (with –queue argument when submitting). 3. Build seperated Spark clusters with independed Yarn Resource manager for regions, such as, UK cluster, US-east cluster, US-west cluster, looks dirty, but easy to deploy and manage, you can schedule the job by the region busy and idle hours to get more performance and lower cost.
Just my 2 cents --- Cheers, -z ________________________________________ From: Stone Zhong <stone.zh...@gmail.com> Sent: Wednesday, April 15, 2020 4:31 To: user@spark.apache.org Subject: Cross Region Apache Spark Setup Hi, I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region. For example, I have parquet file at S3://mybucket/sales_fact.parquet/us-west S3://mybucket/sales_fact.parquet/us-east S3://mybucket/sales_fact.parquet/uk And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all region that I supported. When I have code like: df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*") print(df.count()) #1 print(df.select("product_id").distinct().count()) #2 For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to add 3 regional count and return me a total count. I do not expect large cross region data transfer in this case. For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region, do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again, I am ok with the necessary cross-region data transfer for merging the distinct product_ids Can anyone please share the best practice? Is it possible to config the Apache Spark to work in such a way? Any idea and help is appreciated! Thanks, Stone --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org