Re: Data correctness issue with Repartition + FetchFailure
It's great if you can help with it! Basically, we need to propagate the column-level deterministic information and sort the inputs if the partition key lineage has nondeterminisitc part. On Wed, Mar 16, 2022 at 5:28 AM Jason Xu wrote: > Hi Wenchen, thanks for the insight. Agree, the previous fix for > repartition works for deterministic data. With non-deterministic data, I > didn't find an API to pass DeterministicLevel to underlying rdd. > Do you plan to continue work on integration with SQL operators? If not, > I'm available to take a stab. > > On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan wrote: > >> We fixed the repartition correctness bug before, by sorting the data >> before doing round-robin partitioning. But the issue is that we need to >> propagate the isDeterministic property through SQL operators. >> >> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: >> >>> Hi Reynold, do you suggest removing RoundRobinPartitioning in >>> repartition(numPartitions: Int) API implementation? If that's the direction >>> we're considering, before we have a new implementation, should we suggest >>> users avoid using the repartition(numPartitions: Int) API? >>> >>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin wrote: >>> This is why RoundRobinPartitioning shouldn't be used ... On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu wrote: > Hi Spark community, > > I reported a data correctness issue in > https://issues.apache.org/jira/browse/SPARK-38388. In short, > non-deterministic data + Repartition + FetchFailure could result in > incorrect data, this is an issue we run into in production pipelines, I > have an example to reproduce the bug in the ticket. > > I report here to bring more attention, could you help confirm it's a > bug and worth effort to further investigate and fix, thank you in advance > for help! > > Thanks, > Jason Xu >
Re: Data correctness issue with Repartition + FetchFailure
Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition works for deterministic data. With non-deterministic data, I didn't find an API to pass DeterministicLevel to underlying rdd. Do you plan to continue work on integration with SQL operators? If not, I'm available to take a stab. On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan wrote: > We fixed the repartition correctness bug before, by sorting the data > before doing round-robin partitioning. But the issue is that we need to > propagate the isDeterministic property through SQL operators. > > On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > >> Hi Reynold, do you suggest removing RoundRobinPartitioning in >> repartition(numPartitions: Int) API implementation? If that's the direction >> we're considering, before we have a new implementation, should we suggest >> users avoid using the repartition(numPartitions: Int) API? >> >> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin wrote: >> >>> This is why RoundRobinPartitioning shouldn't be used ... >>> >>> >>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu >>> wrote: >>> Hi Spark community, I reported a data correctness issue in https://issues.apache.org/jira/browse/SPARK-38388. In short, non-deterministic data + Repartition + FetchFailure could result in incorrect data, this is an issue we run into in production pipelines, I have an example to reproduce the bug in the ticket. I report here to bring more attention, could you help confirm it's a bug and worth effort to further investigate and fix, thank you in advance for help! Thanks, Jason Xu >>> >>>
Re: Data correctness issue with Repartition + FetchFailure
We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators. On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > Hi Reynold, do you suggest removing RoundRobinPartitioning in > repartition(numPartitions: Int) API implementation? If that's the direction > we're considering, before we have a new implementation, should we suggest > users avoid using the repartition(numPartitions: Int) API? > > On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin wrote: > >> This is why RoundRobinPartitioning shouldn't be used ... >> >> >> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu >> wrote: >> >>> Hi Spark community, >>> >>> I reported a data correctness issue in >>> https://issues.apache.org/jira/browse/SPARK-38388. In short, >>> non-deterministic data + Repartition + FetchFailure could result in >>> incorrect data, this is an issue we run into in production pipelines, I >>> have an example to reproduce the bug in the ticket. >>> >>> I report here to bring more attention, could you help confirm it's a bug >>> and worth effort to further investigate and fix, thank you in advance for >>> help! >>> >>> Thanks, >>> Jason Xu >>> >> >>
Re: Data correctness issue with Repartition + FetchFailure
Hi Reynold, do you suggest removing RoundRobinPartitioning in repartition(numPartitions: Int) API implementation? If that's the direction we're considering, before we have a new implementation, should we suggest users avoid using the repartition(numPartitions: Int) API? On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin wrote: > This is why RoundRobinPartitioning shouldn't be used ... > > > On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu > wrote: > >> Hi Spark community, >> >> I reported a data correctness issue in >> https://issues.apache.org/jira/browse/SPARK-38388. In short, >> non-deterministic data + Repartition + FetchFailure could result in >> incorrect data, this is an issue we run into in production pipelines, I >> have an example to reproduce the bug in the ticket. >> >> I report here to bring more attention, could you help confirm it's a bug >> and worth effort to further investigate and fix, thank you in advance for >> help! >> >> Thanks, >> Jason Xu >> > >
Re: Data correctness issue with Repartition + FetchFailure
This is why RoundRobinPartitioning shouldn't be used ... On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote: > > Hi Spark community, > > I reported a data correctness issue in https:/ / issues. apache. org/ jira/ > browse/ SPARK-38388 ( https://issues.apache.org/jira/browse/SPARK-38388 ). > In short, non-deterministic data + Repartition + FetchFailure could result > in incorrect data, this is an issue we run into in production pipelines, I > have an example to reproduce the bug in the ticket. > > I report here to bring more attention, could you help confirm it's a bug > and worth effort to further investigate and fix, thank you in advance for > help! > > Thanks, > Jason Xu > smime.p7s Description: S/MIME Cryptographic Signature