Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are
stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I
have node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and
etc, and spark to add 3 regional count and return me a total count. *I do
not expect large cross region data transfer in this case.*
For #2, I expect only us-west nodes read data partition in us-west, and
etc. Each region, do the distinct() locally first, and merge 3 "product_id"
list and do a distinct() again, I am ok with the necessary cross-region
data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the
Apache Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone

Reply via email to