Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
It's great if you can help with it! Basically, we need to propagate the column-level deterministic information and sort the inputs if the partition key lineage has nondeterminisitc part. On Wed, Mar 16, 2022 at 5:28 AM Jason Xu wrote: > Hi Wenchen, thanks for the insight. Agree, the previous

Re: Data correctness issue with Repartition + FetchFailure

2022-03-15 Thread Jason Xu
Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition works for deterministic data. With non-deterministic data, I didn't find an API to pass DeterministicLevel to underlying rdd. Do you plan to continue work on integration with SQL operators? If not, I'm available to take a

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan
We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators. On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > Hi Reynold, do you suggest removing

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Jason Xu
Hi Reynold, do you suggest removing RoundRobinPartitioning in repartition(numPartitions: Int) API implementation? If that's the direction we're considering, before we have a new implementation, should we suggest users avoid using the repartition(numPartitions: Int) API? On Sat, Mar 12, 2022 at

Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin
This is why RoundRobinPartitioning shouldn't be used ... On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote: > > Hi Spark community, > > I reported a data correctness issue in https:/ / issues. apache. org/ jira/ > browse/ SPARK-38388 (

Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Jason Xu
Hi Spark community, I reported a data correctness issue in https://issues.apache.org/jira/browse/SPARK-38388. In short, non-deterministic data + Repartition + FetchFailure could result in incorrect data, this is an issue we run into in production pipelines, I have an example to reproduce the bug