Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
It's great if you can help with it! Basically, we need to propagate the
column-level deterministic information and sort the inputs if the partition
key lineage has nondeterminisitc part.

On Wed, Mar 16, 2022 at 5:28 AM Jason Xu  wrote:

> Hi Wenchen, thanks for the insight. Agree, the previous fix for
> repartition works for deterministic data. With non-deterministic data, I
> didn't find an API to pass DeterministicLevel to underlying rdd.
> Do you plan to continue work on integration with SQL operators? If not,
> I'm available to take a stab.
>
> On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan  wrote:
>
>> We fixed the repartition correctness bug before, by sorting the data
>> before doing round-robin partitioning. But the issue is that we need to
>> propagate the isDeterministic property through SQL operators.
>>
>> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:
>>
>>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>>> repartition(numPartitions: Int) API implementation? If that's the direction
>>> we're considering, before we have a new implementation, should we suggest
>>> users avoid using the repartition(numPartitions: Int) API?
>>>
>>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>>>
 This is why RoundRobinPartitioning shouldn't be used ...


 On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
 wrote:

> Hi Spark community,
>
> I reported a data correctness issue in
> https://issues.apache.org/jira/browse/SPARK-38388. In short,
> non-deterministic data + Repartition + FetchFailure could result in
> incorrect data, this is an issue we run into in production pipelines, I
> have an example to reproduce the bug in the ticket.
>
> I report here to bring more attention, could you help confirm it's a
> bug and worth effort to further investigate and fix, thank you in advance
> for help!
>
> Thanks,
> Jason Xu
>




Re: Data correctness issue with Repartition + FetchFailure

2022-03-15 Thread Jason Xu
Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition
works for deterministic data. With non-deterministic data, I didn't find an
API to pass DeterministicLevel to underlying rdd.
Do you plan to continue work on integration with SQL operators? If not, I'm
available to take a stab.

On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan  wrote:

> We fixed the repartition correctness bug before, by sorting the data
> before doing round-robin partitioning. But the issue is that we need to
> propagate the isDeterministic property through SQL operators.
>
> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:
>
>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>> repartition(numPartitions: Int) API implementation? If that's the direction
>> we're considering, before we have a new implementation, should we suggest
>> users avoid using the repartition(numPartitions: Int) API?
>>
>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>>
>>> This is why RoundRobinPartitioning shouldn't be used ...
>>>
>>>
>>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
>>> wrote:
>>>
 Hi Spark community,

 I reported a data correctness issue in
 https://issues.apache.org/jira/browse/SPARK-38388. In short,
 non-deterministic data + Repartition + FetchFailure could result in
 incorrect data, this is an issue we run into in production pipelines, I
 have an example to reproduce the bug in the ticket.

 I report here to bring more attention, could you help confirm it's a
 bug and worth effort to further investigate and fix, thank you in advance
 for help!

 Thanks,
 Jason Xu

>>>
>>>


Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan
We fixed the repartition correctness bug before, by sorting the data before
doing round-robin partitioning. But the issue is that we need to propagate
the isDeterministic property through SQL operators.

On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:

> Hi Reynold, do you suggest removing RoundRobinPartitioning in
> repartition(numPartitions: Int) API implementation? If that's the direction
> we're considering, before we have a new implementation, should we suggest
> users avoid using the repartition(numPartitions: Int) API?
>
> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>
>> This is why RoundRobinPartitioning shouldn't be used ...
>>
>>
>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
>> wrote:
>>
>>> Hi Spark community,
>>>
>>> I reported a data correctness issue in
>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>> non-deterministic data + Repartition + FetchFailure could result in
>>> incorrect data, this is an issue we run into in production pipelines, I
>>> have an example to reproduce the bug in the ticket.
>>>
>>> I report here to bring more attention, could you help confirm it's a bug
>>> and worth effort to further investigate and fix, thank you in advance for
>>> help!
>>>
>>> Thanks,
>>> Jason Xu
>>>
>>
>>


Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Jason Xu
Hi Reynold, do you suggest removing RoundRobinPartitioning in
repartition(numPartitions: Int) API implementation? If that's the direction
we're considering, before we have a new implementation, should we suggest
users avoid using the repartition(numPartitions: Int) API?

On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:

> This is why RoundRobinPartitioning shouldn't be used ...
>
>
> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
> wrote:
>
>> Hi Spark community,
>>
>> I reported a data correctness issue in
>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>> non-deterministic data + Repartition + FetchFailure could result in
>> incorrect data, this is an issue we run into in production pipelines, I
>> have an example to reproduce the bug in the ticket.
>>
>> I report here to bring more attention, could you help confirm it's a bug
>> and worth effort to further investigate and fix, thank you in advance for
>> help!
>>
>> Thanks,
>> Jason Xu
>>
>
>


Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin
This is why RoundRobinPartitioning shouldn't be used ...

On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote:

> 
> Hi Spark community,
> 
> I reported a data correctness issue in https:/ / issues. apache. org/ jira/
> browse/ SPARK-38388 ( https://issues.apache.org/jira/browse/SPARK-38388 ).
> In short, non-deterministic data + Repartition + FetchFailure could result
> in incorrect data, this is an issue we run into in production pipelines, I
> have an example to reproduce the bug in the ticket.
> 
> I report here to bring more attention, could you help confirm it's a bug
> and worth effort to further investigate and fix, thank you in advance for
> help!
> 
> Thanks,
> Jason Xu
>

smime.p7s
Description: S/MIME Cryptographic Signature