Re: Data Contracts

2023-06-12 Thread Deepak Sharma
Spark can be used with tools like great expectations as well to implement
the data contracts .
I am not sure though if spark alone can do the data contracts .
I was reading a blog on data mesh and how to glue it together with data
contracts , that’s where I came across this spark and great expectations
mention .

HTH

-Deepak

On Tue, 13 Jun 2023 at 12:48 AM, Elliot West  wrote:

> Hi Phillip,
>
> While not as fine-grained as your example, there do exist schema systems
> such as that in Avro that can can evaluate compatible and incompatible
> changes to the schema, from the perspective of the reader, writer, or both.
> This provides some potential degree of enforcement, and means to
> communicate a contract. Interestingly I believe this approach has been
> applied to both JsonSchema and protobuf as part of the Confluent Schema
> registry.
>
> Elliot.
>
> On Mon, 12 Jun 2023 at 12:43, Phillip Henry 
> wrote:
>
>> Hi, folks.
>>
>> There currently seems to be a buzz around "data contracts". From what I
>> can tell, these mainly advocate a cultural solution. But instead, could big
>> data tools be used to enforce these contracts?
>>
>> My questions really are: are there any plans to implement data
>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>> column X must be before that in column Y)? And if not, is there an appetite
>> for them?
>>
>> Maybe we could associate constraints with schema metadata that are
>> enforced in the implementation of a FileFormatDataWriter?
>>
>> Just throwing it out there and wondering what other people think. It's an
>> area that interests me as it seems that over half my problems at the day
>> job are because of dodgy data.
>>
>> Regards,
>>
>> Phillip
>>
>>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jungtaek Lim
I concur with Holden and Mridul. Let's build a plan before we call the
tentative deadline. I understand setting the tentative deadline would
definitely help in pushing back features which "never ever ends", but at
least we may want to list up features and discuss for priority. It is still
possible that we might even want to see some features as hard blocker on
the release for any reason, based on discussion of course.

On Tue, Jun 13, 2023 at 10:58 AM Mridul Muralidharan 
wrote:

>
> I agree with Holden, we should have some understanding of what we are
> targeting for 4.0, given it is a major ver bump - and work from there on
> the release date.
>
> Regards,
> Mridul
>
> On Mon, Jun 12, 2023 at 8:53 PM Jia Fan  wrote:
>
>> By the way, like Holden said, what's big feature for 4.0.0? I think very
>> big version change always bring some different.
>>
>> Jia Fan  于2023年6月13日周二 08:25写道:
>>
>>> +1
>>>
>>> 
>>>
>>> Jia Fan
>>>
>>>
>>>
>>> 2023年6月13日 03:51,Chao Sun  写道:
>>>
>>> +1
>>>
>>> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>>>  wrote:
>>>
 +1 (non-binding)

 Thank you!
 Kazu


 On Jun 12, 2023, at 11:32 AM, Holden Karau 
 wrote:

 -0

 I'd like to see more of a doc around what we're planning on for a 4.0
 before we pick a target release date etc. (feels like cart before the
 horse).

 But it's a weak preference.

 On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:

> Thanks for starting the vote.
>
> I do have a concern about the target release date of Spark 4.0.
>
> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>
>> +1
>>
>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>> wrote:
>> >
>> > +1
>> >
>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> Dongjoon
>> >>
>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>> >> >
>> >> > The vote is open until June 16th 1AM (PST) and passes if a
>> majority +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >> >
>> >> > ===
>> >> > Apache Spark 4.0.0 Release Plan
>> >> > ===
>> >> >
>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>> branch.
>> >> >
>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >> >
>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >> >
>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >> >
>> >>
>> >>
>> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau



>>>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Mridul Muralidharan
I agree with Holden, we should have some understanding of what we are
targeting for 4.0, given it is a major ver bump - and work from there on
the release date.

Regards,
Mridul

On Mon, Jun 12, 2023 at 8:53 PM Jia Fan  wrote:

> By the way, like Holden said, what's big feature for 4.0.0? I think very
> big version change always bring some different.
>
> Jia Fan  于2023年6月13日周二 08:25写道:
>
>> +1
>>
>> 
>>
>> Jia Fan
>>
>>
>>
>> 2023年6月13日 03:51,Chao Sun  写道:
>>
>> +1
>>
>> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>>  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thank you!
>>> Kazu
>>>
>>>
>>> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
>>>
>>> -0
>>>
>>> I'd like to see more of a doc around what we're planning on for a 4.0
>>> before we pick a target release date etc. (feels like cart before the
>>> horse).
>>>
>>> But it's a weak preference.
>>>
>>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>>>
 Thanks for starting the vote.

 I do have a concern about the target release date of Spark 4.0.

 L. C. Hsieh  于2023年6月12日周一 11:09写道:

> +1
>
> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
> wrote:
> >
> > +1
> >
> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
> wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> >> > Please vote on the release plan for Apache Spark 4.0.0.
> >> >
> >> > The vote is open until June 16th 1AM (PST) and passes if a
> majority +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> >> >
> >> > ===
> >> > Apache Spark 4.0.0 Release Plan
> >> > ===
> >> >
> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
> branch.
> >> >
> >> > 2. Creating `branch-4.0` on April 1st, 2024.
> >> >
> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> >> >
> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
> >> >
> >>
> >>
> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>>
>>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jia Fan
By the way, like Holden said, what's big feature for 4.0.0? I think very
big version change always bring some different.

Jia Fan  于2023年6月13日周二 08:25写道:

> +1
>
> 
>
> Jia Fan
>
>
>
> 2023年6月13日 03:51,Chao Sun  写道:
>
> +1
>
> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>  wrote:
>
>> +1 (non-binding)
>>
>> Thank you!
>> Kazu
>>
>>
>> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
>>
>> -0
>>
>> I'd like to see more of a doc around what we're planning on for a 4.0
>> before we pick a target release date etc. (feels like cart before the
>> horse).
>>
>> But it's a weak preference.
>>
>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>>
>>> Thanks for starting the vote.
>>>
>>> I do have a concern about the target release date of Spark 4.0.
>>>
>>> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>>>
 +1

 On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
 wrote:
 >
 > +1
 >
 > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
 wrote:
 >>
 >> +1
 >>
 >> Dongjoon
 >>
 >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
 >> > Please vote on the release plan for Apache Spark 4.0.0.
 >> >
 >> > The vote is open until June 16th 1AM (PST) and passes if a
 majority +1 PMC
 >> > votes are cast, with a minimum of 3 +1 votes.
 >> >
 >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
 >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
 >> >
 >> > ===
 >> > Apache Spark 4.0.0 Release Plan
 >> > ===
 >> >
 >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
 branch.
 >> >
 >> > 2. Creating `branch-4.0` on April 1st, 2024.
 >> >
 >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
 >> >
 >> > 4. Apache Spark 4.0.0 Release in June, 2024.
 >> >
 >>
 >> -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jia Fan
+1



Jia Fan



> 2023年6月13日 03:51,Chao Sun  写道:
> 
> +1
> 
> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura 
>  wrote:
>> +1 (non-binding)
>> 
>> Thank you!
>> Kazu
>> 
>> 
>>> On Jun 12, 2023, at 11:32 AM, Holden Karau >> > wrote:
>>> 
>>> -0
>>> 
>>> I'd like to see more of a doc around what we're planning on for a 4.0 
>>> before we pick a target release date etc. (feels like cart before the 
>>> horse).
>>> 
>>> But it's a weak preference.
>>> 
>>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li >> > wrote:
 Thanks for starting the vote. 
 
 I do have a concern about the target release date of Spark 4.0. 
 
 L. C. Hsieh mailto:vii...@gmail.com>> 于2023年6月12日周一 
 11:09写道:
> +1
> 
> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao  > wrote:
> >
> > +1
> >
> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun  > > wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> >> > Please vote on the release plan for Apache Spark 4.0.0.
> >> >
> >> > The vote is open until June 16th 1AM (PST) and passes if a majority 
> >> > +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> >> >
> >> > ===
> >> > Apache Spark 4.0.0 Release Plan
> >> > ===
> >> >
> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master 
> >> > branch.
> >> >
> >> > 2. Creating `branch-4.0` on April 1st, 2024.
> >> >
> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> >> >
> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> 
> >>
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
>>> 
>>> 
>>> -- 
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> 



Re: Spark on Kube (virtua) coffee/tea/pop times

2023-06-12 Thread Mich Talebzadeh
Hi all,

Has there been any progress on the item list summarised by Holden namely

   - Inter-Pod security, istio + mTLS
   - Sidecar management
   - Docker Images
   - Add links to more related images
   - - Helm links
   - Data Locality concerns
   - Upgrading  Spark Versions
   - Performance issues


I may have missed some context.

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Feb 2023 at 08:24, Holden Karau  wrote:

> Some general issues we found common ground around:
>
> Inter-Pod security, istio + mTLS
> Sidecar management
> Docker Images
> Add links to more related images
> - Helm links
> Data Locality concerns
> Upgrading  Spark Versions
> Performance issues
>
> Thanks to everyone who was able to make the informal coffee chat
>
> I'll try and schedule another one at a more European friendly time so that
> we can all get to chat as well.
>
> On Fri, Feb 10, 2023 at 1:08 PM Mich Talebzadeh 
> wrote:
>
>> Great looking forward to it
>>
>> Mich
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 10 Feb 2023 at 18:58, Holden Karau  wrote:
>>
>>> Ok so the first iteration of this is booked:
>>>
>>>
>>> Spark on Kube Coffee Chats
>>> Sunday, Feb 12 · 6–7 PM pacific time
>>> Google Meet joining info
>>> Video call link: https://meet.google.com/wge-tzzd-uyj
>>>
>>> Assuming that all goes well I’ll send out another doodle pole after this
>>> one for the folks who could not make this one.
>>>
>>> Looking forward to catching up with y’all :) No prep work necessary but
>>> if anyone wants to write down a brief like two sentence blurb about their
>>> goals for Spark on Kube was thinking we might go around the virtual room
>>> sharing that as our kicking off point for this coffee meeting :)
>>>
>>>
>>> On Wed, Feb 8, 2023 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 That sounds like a good plan Holden!


 Let us go for it


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Wed, 8 Feb 2023 at 20:12, Holden Karau  wrote:

> My thought here was that it's more focused on getting to understand
> each other's goals / priorities and less solving any specific problem.
>
> For example, I know that some folks running on EKS have different
> priorities than folks running on-prem.
>
> We might (later on) make a roadmap doc if that seems necessary, but
> I'm hoping that just an understanding of folks priorities and challenges
> will make it easier for us to all collaborate.
>
> On Wed, Feb 8, 2023 at 11:47 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi all,
>>
>> Is this going to be a brainstorming meeting or there will be a prior
>> agenda to work around it?
>>
>> thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Feb 2023 at 18:33, Mich Talebzadeh <
>> mich.talebza...@gmail.com> 

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Chao Sun
+1

On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
 wrote:

> +1 (non-binding)
>
> Thank you!
> Kazu
>
>
> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
>
> -0
>
> I'd like to see more of a doc around what we're planning on for a 4.0
> before we pick a target release date etc. (feels like cart before the
> horse).
>
> But it's a weak preference.
>
> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>
>> Thanks for starting the vote.
>>
>> I do have a concern about the target release date of Spark 4.0.
>>
>> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>>
>>> +1
>>>
>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Dongjoon
>>> >>
>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>> >> >
>>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
>>> +1 PMC
>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>> >> >
>>> >> > ===
>>> >> > Apache Spark 4.0.0 Release Plan
>>> >> > ===
>>> >> >
>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>>> branch.
>>> >> >
>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>> >> >
>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>> >> >
>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread kazuyuki tanimura
+1 (non-binding)

Thank you!
Kazu


> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
> 
> -0
> 
> I'd like to see more of a doc around what we're planning on for a 4.0 before 
> we pick a target release date etc. (feels like cart before the horse).
> 
> But it's a weak preference.
> 
> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  > wrote:
>> Thanks for starting the vote. 
>> 
>> I do have a concern about the target release date of Spark 4.0. 
>> 
>> L. C. Hsieh mailto:vii...@gmail.com>> 于2023年6月12日周一 
>> 11:09写道:
>>> +1
>>> 
>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao >> > wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun >> > > wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Dongjoon
>>> >>
>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>> >> >
>>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority +1 
>>> >> > PMC
>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>> >> >
>>> >> > ===
>>> >> > Apache Spark 4.0.0 Release Plan
>>> >> > ===
>>> >> >
>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
>>> >> >
>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>> >> >
>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>> >> >
>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> >> 
>>> >>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> 
>>> 
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



Re: Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Dongjoon Hyun
Holden, I agree with you a lot in a sense that this is a chicken and egg 
situation.
Spark v4.0 release is a really big one, isn't it?

(1) First, do you think the proposed items are 'BLOCKER'-level Apache Spark 4.0 
JIRA items? May I ask why you think in that way? If I understand more, I can 
help them. For now, as I wrote in "Apache Spark 4.0 Timeframe," there are some 
main issues where multiple PMC members called out the necessity of Spark 4.0.0 
release as the only feasible path forward. In that level, although I support 
the proposed items but those look like nice-to-have items to me because we can 
try them in Apache Spark 3.5.0 too if they are ready and mature. To be clear, I 
want to say I'm open for the PRs and interested in what we are going to have 
with them, Holden.

https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
(Apache Spark 3.5.0 Expectations?)

https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
(Apache Spark 4.0 Timeframe?)


(2) Second, back to the vote, the reason why I proposed the vote on '4.0.0 
Plan' is that planning itself has been not considered by default in the 
community.

https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
(ASF policy violation and Scala version issues)

> I will start a vote for Apache Spark 4.0.0 timeframe next week after
> receiving more feedback.
> Since 4.0.0 is not limited to the Scala issues, we will vote on the
> timeline only.


I'm one of the people who feel a responsibility to provide a way to escape this 
deadlock situation in the community. It's too easy to say 'No in 3.x era' or to 
say 'No 4.0 until my patch is in the promised land'. Without a consensus we 
have a 4.0 release, we are only daydreaming about a non-existent Spark 4.0 
without any efforts and without knowing why we are blocked. I believe 'the vote 
on plan' unleash the community release train in order to re-ignite all 
discussion about the previously prohibited items.

https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
([VOTE] Release Plan for Apache Spark 4.0.0 (June 2024))


Holden, could you think it in this way too?

Thanks,
Dongjoon.


On 2023/06/12 18:57:32 Holden Karau wrote:
> Yup I think buidling consensus on what goes in 4.X is something we’ll need
> to do.
> 
> On Mon, Jun 12, 2023 at 11:56 AM Dongjoon Hyun 
> wrote:
> 
> > Thank you for sharing those. I'm also interested in taking advantage of
> > it. Also, I hope `spark-upgrade` can help us in line with Spark 4.0.
> >
> > However, we don't need to discuss any of this if we don't build a
> > consensus on both Spark 4.0 or next Scala version.
> >
> > We don't have a vehicle at all to reach there yet.
> >
> > In the community, I saw a bottleneck; "No in 3.x era" and "No for 4.0 yet
> > because XXX"
> >
> > Dongjoon.
> >
> > --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Data Contracts

2023-06-12 Thread Elliot West
Hi Phillip,

While not as fine-grained as your example, there do exist schema systems
such as that in Avro that can can evaluate compatible and incompatible
changes to the schema, from the perspective of the reader, writer, or both.
This provides some potential degree of enforcement, and means to
communicate a contract. Interestingly I believe this approach has been
applied to both JsonSchema and protobuf as part of the Confluent Schema
registry.

Elliot.

On Mon, 12 Jun 2023 at 12:43, Phillip Henry  wrote:

> Hi, folks.
>
> There currently seems to be a buzz around "data contracts". From what I
> can tell, these mainly advocate a cultural solution. But instead, could big
> data tools be used to enforce these contracts?
>
> My questions really are: are there any plans to implement data constraints
> in Spark (eg, an integer must be between 0 and 100; the date in column X
> must be before that in column Y)? And if not, is there an appetite for them?
>
> Maybe we could associate constraints with schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>


Re: Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Holden Karau
Yup I think buidling consensus on what goes in 4.X is something we’ll need
to do.

On Mon, Jun 12, 2023 at 11:56 AM Dongjoon Hyun 
wrote:

> Thank you for sharing those. I'm also interested in taking advantage of
> it. Also, I hope `spark-upgrade` can help us in line with Spark 4.0.
>
> However, we don't need to discuss any of this if we don't build a
> consensus on both Spark 4.0 or next Scala version.
>
> We don't have a vehicle at all to reach there yet.
>
> In the community, I saw a bottleneck; "No in 3.x era" and "No for 4.0 yet
> because XXX"
>
> Dongjoon.
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Dongjoon Hyun
Thank you for sharing those. I'm also interested in taking advantage of it.
Also, I hope `spark-upgrade` can help us in line with Spark 4.0.

However, we don't need to discuss any of this if we don't build a consensus
on both Spark 4.0 or next Scala version.

We don't have a vehicle at all to reach there yet.

In the community, I saw a bottleneck; "No in 3.x era" and "No for 4.0 yet
because XXX"

Dongjoon.


Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Holden Karau
My self and a few folks have been working on a spark-upgrade project
(focused on getting folks onto current versions of Spark). Since it looks
like were starting the discussion around Spark 4 I was thinking now could
be a good time for us to consider if we want to try and integrate
auto-upgrade rules like some other scala projects.

Context:
- https://github.com/scala-steward-org/scala-steward
- https://scalacenter.github.io/scalafix/
- https://github.com/holdenk/spark-upgrade

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Holden Karau
-0

I'd like to see more of a doc around what we're planning on for a 4.0
before we pick a target release date etc. (feels like cart before the
horse).

But it's a weak preference.

On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:

> Thanks for starting the vote.
>
> I do have a concern about the target release date of Spark 4.0.
>
> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>
>> +1
>>
>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>> wrote:
>> >
>> > +1
>> >
>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> Dongjoon
>> >>
>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>> >> >
>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
>> +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >> >
>> >> > ===
>> >> > Apache Spark 4.0.0 Release Plan
>> >> > ===
>> >> >
>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>> branch.
>> >> >
>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >> >
>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >> >
>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Xiao Li
Thanks for starting the vote.

I do have a concern about the target release date of Spark 4.0.

L. C. Hsieh  于2023年6月12日周一 11:09写道:

> +1
>
> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
> wrote:
> >
> > +1
> >
> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
> wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> >> > Please vote on the release plan for Apache Spark 4.0.0.
> >> >
> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
> +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> >> >
> >> > ===
> >> > Apache Spark 4.0.0 Release Plan
> >> > ===
> >> >
> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
> >> >
> >> > 2. Creating `branch-4.0` on April 1st, 2024.
> >> >
> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> >> >
> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread L. C. Hsieh
+1

On Mon, Jun 12, 2023 at 11:06 AM huaxin gao  wrote:
>
> +1
>
> On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun  wrote:
>>
>> +1
>>
>> Dongjoon
>>
>> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> > Please vote on the release plan for Apache Spark 4.0.0.
>> >
>> > The vote is open until June 16th 1AM (PST) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >
>> > ===
>> > Apache Spark 4.0.0 Release Plan
>> > ===
>> >
>> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
>> >
>> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >
>> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >
>> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread huaxin gao
+1

On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun  wrote:

> +1
>
> Dongjoon
>
> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> > Please vote on the release plan for Apache Spark 4.0.0.
> >
> > The vote is open until June 16th 1AM (PST) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> >
> > ===
> > Apache Spark 4.0.0 Release Plan
> > ===
> >
> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
> >
> > 2. Creating `branch-4.0` on April 1st, 2024.
> >
> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> >
> > 4. Apache Spark 4.0.0 Release in June, 2024.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Dongjoon Hyun
+1

Dongjoon

On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> Please vote on the release plan for Apache Spark 4.0.0.
> 
> The vote is open until June 16th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> 
> ===
> Apache Spark 4.0.0 Release Plan
> ===
> 
> 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
> 
> 2. Creating `branch-4.0` on April 1st, 2024.
> 
> 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> 
> 4. Apache Spark 4.0.0 Release in June, 2024.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Dongjoon Hyun
Please vote on the release plan for Apache Spark 4.0.0.

The vote is open until June 16th 1AM (PST) and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
[ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...

===
Apache Spark 4.0.0 Release Plan
===

1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.

2. Creating `branch-4.0` on April 1st, 2024.

3. Apache Spark 4.0.0 RC1 on May 1st, 2024.

4. Apache Spark 4.0.0 Release in June, 2024.


Re: Data Contracts

2023-06-12 Thread Ryan Blue
Hey Phillip,

You're right that we can improve tooling to help with data contracts, but I
think that a contract still needs to be an agreement between people.
Constraints help by helping to ensure a data producer adheres to the
contract and gives feedback as soon as possible when assumptions are
violated. The problem with considering that the only contract is that it's
too easy to change it. For example, if I change a required column to a
nullable column, that's a perfectly valid transition --- but only if I've
communicated that change to downstream consumers.

Ryan

On Mon, Jun 12, 2023 at 4:43 AM Phillip Henry 
wrote:

> Hi, folks.
>
> There currently seems to be a buzz around "data contracts". From what I
> can tell, these mainly advocate a cultural solution. But instead, could big
> data tools be used to enforce these contracts?
>
> My questions really are: are there any plans to implement data constraints
> in Spark (eg, an integer must be between 0 and 100; the date in column X
> must be before that in column Y)? And if not, is there an appetite for them?
>
> Maybe we could associate constraints with schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>

-- 
Ryan Blue
Tabular


Data Contracts

2023-06-12 Thread Phillip Henry
Hi, folks.

There currently seems to be a buzz around "data contracts". From what I can
tell, these mainly advocate a cultural solution. But instead, could big
data tools be used to enforce these contracts?

My questions really are: are there any plans to implement data constraints
in Spark (eg, an integer must be between 0 and 100; the date in column X
must be before that in column Y)? And if not, is there an appetite for them?

Maybe we could associate constraints with schema metadata that are enforced
in the implementation of a FileFormatDataWriter?

Just throwing it out there and wondering what other people think. It's an
area that interests me as it seems that over half my problems at the day
job are because of dodgy data.

Regards,

Phillip


Re: Apache Spark 3.4.1 Release?

2023-06-12 Thread beliefer
Dongjoon. Thank you.
There is a issue should be fixed.
https://issues.apache.org/jira/browse/SPARK-44018







在 2023-06-12 13:22:30,"Dongjoon Hyun"  写道:

Thank you all.


I'll check and prepare `branch-3.4` for the target date, June 20th.


Dongjoon.




On Fri, Jun 9, 2023 at 10:47 PM yangjie01  wrote:


+1

 

Thank you Dongjoon ~

 

发件人: Ruifeng Zheng 
日期: 2023年6月10日星期六 09:39
收件人: Xiao Li 
抄送: Wenchen Fan , Xinrong Meng , dev 

主题: Re: Apache Spark 3.4.1 Release?

 

+1

 

Thank you Dongjoon!

 

 

On Fri, Jun 9, 2023 at 11:54 PM Xiao Li  wrote:

+1

 

On Fri, Jun 9, 2023 at 08:30 Wenchen Fan  wrote:

+1

 

On Fri, Jun 9, 2023 at 8:52 PM Xinrong Meng  wrote:

+1. Thank you Doonjoon!

 

Thanks,

 

Xinrong Meng

 

Mridul Muralidharan 于2023年6月9日 周五上午5:22写道:

 

+1, thanks Dongjoon !

 

Regards,

Mridul 

 

On Thu, Jun 8, 2023 at 7:16 PM Jia Fan  wrote:

+1






Jia Fan








2023年6月9日 08:00,Yuming Wang  写道:

 

+1.

 

On Fri, Jun 9, 2023 at 7:14 AM Chao Sun  wrote:

+1 too

On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
 wrote:
>
> +1 (non-binding), Thank you Dongjoon
>
> Kazu
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 

--

Re: ASF policy violation and Scala version issues

2023-06-12 Thread Dongjoon Hyun
Let me add my answers about a few Scala questions, Jungtaek.

> Are we concerned that a library does not release a new version
> which bumps the Scala version, which the Scala version is
> announced in less than a week?

No, we have concerns about the newly introduced disability
in the Apache Spark Scala environment.



> Shall we respect the efforts of all maintainers of open source projects
> we use as dependencies, regardless whether they are ASF projects or
> individuals?

Not only respecting all the efforts, but also Yang Jie and I've been
participating in those individual projects to help them and us.
I believe that we've aimed our best collaboration there.


> Bumping a bugfix version is not always safe,
> especially for Scala where they use semver as one level down
> their minor version is almost another's major version
> (similar amount of pain on upgrading).

I agree with you in two ways.

1. Before adding Ammonite dependency, Apache Spark community itself was one
of the major Scala users who participated in new version testing and we
gave active feedback to the Scala community. In addition, we decide whether
to consume it or not by ourselves. Now, the Apache Spark community has lost
our ability to consume it because it fails at the dependency downloading
step. We are waiting because we don't have an alternative. That's a big
difference; to be able or not.

2. Again, I must reiterate that that's one of the reasons why I reported an
issue, "There is a company claiming something non-Apache like "Apache Spark
3.4.0 minus SPARK-40436" with the name "Apache Spark 3.4.0."


Dongjoon.