Thank you Wei.

I will look into #1. With option 2, seems it will push the complexity to
application -- application need to write multiple queries and merge the
final result.

Regards,
Stone

On Mon, Apr 20, 2020 at 7:39 AM ZHANG Wei <wezh...@outlook.com> wrote:

> There might be 3 options:
>
> 1. Just as you expect,  only ONE application, ONE rdd with regioned
> containers and executors automatically allocated and distributed, the
> ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495) may
> meet the requirement, treating Region as a type of resource just like GPU.
> But you have to wait for the full feature finished. And I can image the
> trouble shooting challenges.
> 2. Label Yarn nodes with region tag, group them into queues and submit the
> different jobs for different regions into dedicate queues (with –queue
> argument when submitting).
> 3. Build seperated Spark clusters with independed Yarn Resource manager
> for regions, such as, UK cluster, US-east cluster, US-west cluster, looks
> dirty, but easy to deploy and manage, you can schedule the job by the
> region busy and idle hours to get more performance and lower cost.
>
> Just my 2 cents
>
> ---
> Cheers,
> -z
>
> ________________________________________
> From: Stone Zhong <stone.zh...@gmail.com>
> Sent: Wednesday, April 15, 2020 4:31
> To: user@spark.apache.org
> Subject: Cross Region Apache Spark Setup
>
> Hi,
>
> I am trying to setup a cross region Apache Spark cluster. All my data are
> stored in Amazon S3 and well partitioned by region.
>
> For example, I have parquet file at
>     S3://mybucket/sales_fact.parquet/us-west
>     S3://mybucket/sales_fact.parquet/us-east
>     S3://mybucket/sales_fact.parquet/uk
>
> And my cluster have nodes in us-west, us-east and uk region -- basically I
> have node in all region that I supported.
>
> When I have code like:
>
> df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
> print(df.count()) #1
> print(df.select("product_id").distinct().count()) #2
>
> For #1, I expect only us-west nodes read data partition in us-west, and
> etc, and spark to add 3 regional count and return me a total count. I do
> not expect large cross region data transfer in this case.
> For #2, I expect only us-west nodes read data partition in us-west, and
> etc. Each region, do the distinct() locally first, and merge 3 "product_id"
> list and do a distinct() again, I am ok with the necessary cross-region
> data transfer for merging the distinct product_ids
>
> Can anyone please share the best practice? Is it possible to config the
> Apache Spark to work in such a way?
>
> Any idea and help is appreciated!
>
> Thanks,
> Stone
>

Reply via email to