Thank you Wei. I will look into #1. With option 2, seems it will push the complexity to application -- application need to write multiple queries and merge the final result.
Regards, Stone On Mon, Apr 20, 2020 at 7:39 AM ZHANG Wei <wezh...@outlook.com> wrote: > There might be 3 options: > > 1. Just as you expect, only ONE application, ONE rdd with regioned > containers and executors automatically allocated and distributed, the > ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495) may > meet the requirement, treating Region as a type of resource just like GPU. > But you have to wait for the full feature finished. And I can image the > trouble shooting challenges. > 2. Label Yarn nodes with region tag, group them into queues and submit the > different jobs for different regions into dedicate queues (with –queue > argument when submitting). > 3. Build seperated Spark clusters with independed Yarn Resource manager > for regions, such as, UK cluster, US-east cluster, US-west cluster, looks > dirty, but easy to deploy and manage, you can schedule the job by the > region busy and idle hours to get more performance and lower cost. > > Just my 2 cents > > --- > Cheers, > -z > > ________________________________________ > From: Stone Zhong <stone.zh...@gmail.com> > Sent: Wednesday, April 15, 2020 4:31 > To: user@spark.apache.org > Subject: Cross Region Apache Spark Setup > > Hi, > > I am trying to setup a cross region Apache Spark cluster. All my data are > stored in Amazon S3 and well partitioned by region. > > For example, I have parquet file at > S3://mybucket/sales_fact.parquet/us-west > S3://mybucket/sales_fact.parquet/us-east > S3://mybucket/sales_fact.parquet/uk > > And my cluster have nodes in us-west, us-east and uk region -- basically I > have node in all region that I supported. > > When I have code like: > > df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*") > print(df.count()) #1 > print(df.select("product_id").distinct().count()) #2 > > For #1, I expect only us-west nodes read data partition in us-west, and > etc, and spark to add 3 regional count and return me a total count. I do > not expect large cross region data transfer in this case. > For #2, I expect only us-west nodes read data partition in us-west, and > etc. Each region, do the distinct() locally first, and merge 3 "product_id" > list and do a distinct() again, I am ok with the necessary cross-region > data transfer for merging the distinct product_ids > > Can anyone please share the best practice? Is it possible to config the > Apache Spark to work in such a way? > > Any idea and help is appreciated! > > Thanks, > Stone >