Hi, I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region.
For example, I have parquet file at S3://mybucket/sales_fact.parquet/us-west S3://mybucket/sales_fact.parquet/us-east S3://mybucket/sales_fact.parquet/uk And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all region that I supported. When I have code like: df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*") print(df.count()) #1 print(df.select("product_id").distinct().count()) #2 For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to add 3 regional count and return me a total count. *I do not expect large cross region data transfer in this case.* For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region, do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again, I am ok with the necessary cross-region data transfer for merging the distinct product_ids Can anyone please share the best practice? Is it possible to config the Apache Spark to work in such a way? Any idea and help is appreciated! Thanks, Stone