Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of

implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Pengfei Li
Hi All, I'm developing a DataSource on Spark 3.2 to write data to our system, and using DataSource V2 API. I want to implement the interface RequiresDistributionAndOrdering