[DISCUSS] Porting back SPARK-45178 to 3.5/3.4 version lines

2023-09-20 Thread Jungtaek Lim
Hi devs,

I'd like to get some inputs for dealing with the possible correctness issue
we figured. The JIRA ticket is SPARK-45178
 and I described the
issue and solution I proposed.

Context:
Source might behave incorrectly leading to correctness issues if it does
not support Trigger.AvailableNow and users set the trigger to
Trigger.AvailableNow. This is due to the incompatibility between fallback
implementation of Trigger.AvailableNow and the source implementation. As a
solution, we want to fall back to single back execution instead for such
cases.

The proposal is approved and merged in master branch (I guess there is no
issue as it's a major release), but since this introduces a behavioral
change, I'd like to hear voices on whether we want to introduce a
behavioral change in bugfix versions to address possible correctness, or
leave these version lines as they are.

Looking for voices on this.

Thanks in advance!
Jungtaek Lim (HeartSaVioR)


Re: Plans for built-in v2 data sources in Spark 4

2023-09-20 Thread Dongjoon Hyun
Instead of that, I believe you are looking for
`spark.sql.sources.useV1SourceList` if the question is about "Concretely,
is the plan for Spark 4 to continue defaulting to the built-in v1 data
sources?".

Here is the code.

https://github.com/apache/spark/blob/324a07b534ac8c2e83a50ac5ea4c5d93fd57b790/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3148-L3155

Dongjoon.



On Wed, Sep 20, 2023 at 5:47 AM Will Raschkowski 
wrote:

> Thank you for linking that, Dongjoon!
>
>
>
> I found SPARK-44518  in
> that list which wants to turn Spark’s Hive integration into a data source.
> IIUC, that’s very related but I’m curious if I’m thinking about this
> correctly:
>
>
>
> Big gaps between built-in v1 and v2 data sources are support for bucketing
> and partitioning. And the reason v1 data sources support those is because
> the v1 paths are kind of interleaved with Spark’s Hive integration. I
> understand separating that Hive integration or making it more data
> source-ish would put us closer to supporting bucketing and partitioning in
> v2 and then defaulting to v2.
>
>
>
> *From: *Dongjoon Hyun 
> *Date: *Friday, 15 September 2023 at 05:36
> *To: *Will Raschkowski 
> *Cc: *dev@spark.apache.org 
> *Subject: *Re: Plans for built-in v2 data sources in Spark 4
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi, Will.
>
> According to the following JIRA, as of now, there is no plan or on-going
> discussion to switch it.
>
> https://issues.apache.org/jira/browse/SPARK-44111 [issues.apache.org]
> 
> (Prepare Apache Spark 4.0.0)
>
> Thanks,
> Dongjoon.
>
>
>
>
>
> On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski
>  wrote:
>
> Hey everyone,
>
>
>
> I was wondering what the plans are for Spark's built-in v2 file data
> sources in Spark 4.
>
>
>
> Concretely, is the plan for Spark 4 to continue defaulting to the built-in
> v1 data sources? And if yes, what are the blockers for defaulting to v2? I
> see, just as example, that writing Hive-partitions is not supported in v2.
> Are there other blockers or outstanding discussions?
>
>
>
> Regards,
>
> Will
>
>
>
>


Re: Plans for built-in v2 data sources in Spark 4

2023-09-20 Thread Will Raschkowski
Thank you for linking that, Dongjoon!

I found SPARK-44518 in that 
list which wants to turn Spark’s Hive integration into a data source. To think 
out loud: The big gaps between built-in v1 and v2 data sources are support for 
bucketing and partitioning. And the reason v1 data sources support those is 
because they’re kind of interleaved with Spark’s Hive integration. Separating 
that Hive integration or making it more data source-ish would put us close to 
supporting bucketing and partitioning in v2 and then defaulting to v2. (Just my 
understanding – curious if I’m thinking about this correctly).

Anyway, thank you for the pointer.

From: Dongjoon Hyun 
Date: Friday, 15 September 2023 at 05:36
To: Will Raschkowski 
Cc: dev@spark.apache.org 
Subject: Re: Plans for built-in v2 data sources in Spark 4
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi, Will.

According to the following JIRA, as of now, there is no plan or on-going 
discussion to switch it.

https://issues.apache.org/jira/browse/SPARK-44111 
[issues.apache.org]
 (Prepare Apache Spark 4.0.0)

Thanks,
Dongjoon.


On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski 
 wrote:
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will



Re: Plans for built-in v2 data sources in Spark 4

2023-09-20 Thread Will Raschkowski
Thank you for linking that, Dongjoon!

I found SPARK-44518 in that 
list which wants to turn Spark’s Hive integration into a data source. IIUC, 
that’s very related but I’m curious if I’m thinking about this correctly:

Big gaps between built-in v1 and v2 data sources are support for bucketing and 
partitioning. And the reason v1 data sources support those is because the v1 
paths are kind of interleaved with Spark’s Hive integration. I understand 
separating that Hive integration or making it more data source-ish would put us 
closer to supporting bucketing and partitioning in v2 and then defaulting to v2.

From: Dongjoon Hyun 
Date: Friday, 15 September 2023 at 05:36
To: Will Raschkowski 
Cc: dev@spark.apache.org 
Subject: Re: Plans for built-in v2 data sources in Spark 4
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi, Will.

According to the following JIRA, as of now, there is no plan or on-going 
discussion to switch it.

https://issues.apache.org/jira/browse/SPARK-44111 
[issues.apache.org]
 (Prepare Apache Spark 4.0.0)

Thanks,
Dongjoon.


On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski 
 wrote:
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will