There might be 3 options:

1. Just as you expect,  only ONE application, ONE rdd with regioned containers 
and executors automatically allocated and distributed, the ResourceProfile 
(https://issues.apache.org/jira/browse/SPARK-27495) may meet the requirement, 
treating Region as a type of resource just like GPU. But you have to wait for 
the full feature finished. And I can image the trouble shooting challenges.
2. Label Yarn nodes with region tag, group them into queues and submit the 
different jobs for different regions into dedicate queues (with –queue argument 
when submitting).
3. Build seperated Spark clusters with independed Yarn Resource manager for 
regions, such as, UK cluster, US-east cluster, US-west cluster, looks dirty, 
but easy to deploy and manage, you can schedule the job by the region busy and 
idle hours to get more performance and lower cost.

Just my 2 cents

---
Cheers,
-z

________________________________________
From: Stone Zhong <stone.zh...@gmail.com>
Sent: Wednesday, April 15, 2020 4:31
To: user@spark.apache.org
Subject: Cross Region Apache Spark Setup

Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are 
stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I have 
node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, 
and spark to add 3 regional count and return me a total count. I do not expect 
large cross region data transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. 
Each region, do the distinct() locally first, and merge 3 "product_id" list and 
do a distinct() again, I am ok with the necessary cross-region data transfer 
for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache 
Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to