Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Dongjoon Hyun
+1

On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
wrote:

> +1
>
> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:
>
>
> +1
>
>
> Jungtaek Lim  kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
> >
> > Hi dev,
> >
> > looks like there are a huge number of commits being pushed to branch-3.5
> after 3.5.0 was released, 200+ commits.
> >
> > $ git log --oneline v3.5.0..HEAD | wc -l
> > 202
> >
> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and
> 10 resolved issues are either marked as blocker (even correctness issues)
> or critical, which justifies the release.
> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
> >
> > What do you think about releasing 3.5.1 with the current head of
> branch-3.5? I'm happy to volunteer as the release manager.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
>
>
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread yangjie01
+1

在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:


+1


Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>
> Hi dev,
>
> looks like there are a huge number of commits being pushed to branch-3.5 
> after 3.5.0 was released, 200+ commits.
>
> $ git log --oneline v3.5.0..HEAD | wc -l
> 202
>
> Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10 
> resolved issues are either marked as blocker (even correctness issues) or 
> critical, which justifies the release.
> https://issues.apache.org/jira/projects/SPARK/versions/12353495 
> 
>
> What do you think about releasing 3.5.1 with the current head of branch-3.5? 
> I'm happy to volunteer as the release manager.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 







-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Kent Yao
+1

Jungtaek Lim  于2024年2月3日周六 21:14写道:
>
> Hi dev,
>
> looks like there are a huge number of commits being pushed to branch-3.5 
> after 3.5.0 was released, 200+ commits.
>
> $ git log --oneline v3.5.0..HEAD | wc -l
> 202
>
> Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10 
> resolved issues are either marked as blocker (even correctness issues) or 
> critical, which justifies the release.
> https://issues.apache.org/jira/projects/SPARK/versions/12353495
>
> What do you think about releasing 3.5.1 with the current head of branch-3.5? 
> I'm happy to volunteer as the release manager.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Neil Ramaswamy
Re: verbosity: yes, it will be more verbose. A config I was planning to
implement was a default-on console sink option, verboseMode, that you can
set to be off if you just want sink data. I don't think that introduces
additional complexity, as the last point suggests. (And also, nobody should
be using this for "high data throughput" scenarios or
"performance-sensitive applications". It's a development sink.)

I don't think that exposing these details increases the learning curve:
these details are *essential *for understanding how Structured Streaming
works. I'd actually argue that it makes the learning curve shallower: by
showing the few variables that affect the behavior of their pipelines,
they'll have the conceptual understanding to answer essential questions
like "why aren't my results showing up?" or "why is my state size always
increasing?"

Also: for stateless pipelines, none of this event-time and state detail
applies. We would just render sink data—no behavior change from today. That
seems gentle enough to me: start with stateless pipelines and see
the output rows, but when you advance to stateful pipelines, you need to
deal with the two complexities (event-time and state) of stateful streaming.

On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh 
wrote:

> Hi,
>
> As I understood, the proposal you mentioned suggests adding event-time
> and state store metadata to the console sink to better highlight the
> semantics of the Structured Streaming engine. While I agree this
> enhancement can provide valuable insights into the engine's behavior
> especially for newcomers, there are potential challenges that we need to be
> aware of:
>
> - Including additional metadata in the console sink output can increase
> the volume of information printed. This might result in a more verbose
> console output, making it harder to observe the actual data from the
> metadata, especially in scenarios with high data throughput.
> - Added verbosity, the proposed additional metadata may make the console
> output more verbose, potentially affecting its readability, especially for
> users who are primarily interested in the processed data and not the
> internal engine details.
> - Users unfamiliar with the internal workings of Structured Streaming
> might misinterpret the metadata as part of the actual data, leading to
> confusion.
> - The act of printing additional metadata to the console may introduce
> some overhead, especially in scenarios where high-frequency updates occur.
> While this overhead might be minimal, it is worth considering it in
> performance-sensitive applications.
> - While the proposal aims to make it easier for beginners to understand
> concepts like watermarks, operator state, and output rows, it could
> potentially increase the learning curve due to the introduction of
> additional terminology and information.
> - Users might benefit from the ability to selectively enable or disable
> the display of certain metadata elements to tailor the console output to
> their specific needs. However, this introduces additional complexity.
>
> As usual with these things, your mileage varies. Whilst the proposed
> enhancements offer valuable insights into the behavior of Structured
> Streaming, we ought to think about the potential downsides, particularly in
> terms of increased verbosity, complexity, and the impact on user experience
>
> HTH
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
>  wrote:
>
>> Hi all,
>>
>> I'd like to propose the idea of enhancing Structured Streaming's console
>> sink to print event-time metrics and state store data, in addition to the
>> sink's rows.
>>
>> I've noticed beginners often struggle to understand how watermarks,
>> operator state, and output rows are all intertwined. By printing all of
>> this information in the same place, I think that this sink will make it
>> easier for users to see—and our docs to explain—how these concepts work
>> together.
>>
>> For example, our docs could walk the users through a query with a
>> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
>> second watermark. After processing something like (foo, 17) and (bar, 15),
>> writing another record (baz, 36) to the source would cause the following to
>> print for batch 2:
>>
>> ++
>>
>> |  WRITES TO SINK (Batch = 2)|
>>
>> 

[DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Jungtaek Lim
Hi dev,

looks like there are a huge number of commits being pushed to branch-3.5
after 3.5.0 was released, 200+ commits.

$ git log --oneline v3.5.0..HEAD | wc -l
202

Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10
resolved issues are either marked as blocker (even correctness issues) or
critical, which justifies the release.
https://issues.apache.org/jira/projects/SPARK/versions/12353495

What do you think about releasing 3.5.1 with the current head of
branch-3.5? I'm happy to volunteer as the release manager.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Mich Talebzadeh
Hi,

As I understood, the proposal you mentioned suggests adding event-time and
state store metadata to the console sink to better highlight the semantics
of the Structured Streaming engine. While I agree this enhancement can
provide valuable insights into the engine's behavior especially for
newcomers, there are potential challenges that we need to be aware of:

- Including additional metadata in the console sink output can increase the
volume of information printed. This might result in a more verbose console
output, making it harder to observe the actual data from the metadata,
especially in scenarios with high data throughput.
- Added verbosity, the proposed additional metadata may make the console
output more verbose, potentially affecting its readability, especially for
users who are primarily interested in the processed data and not the
internal engine details.
- Users unfamiliar with the internal workings of Structured Streaming might
misinterpret the metadata as part of the actual data, leading to confusion.
- The act of printing additional metadata to the console may introduce some
overhead, especially in scenarios where high-frequency updates occur. While
this overhead might be minimal, it is worth considering it in
performance-sensitive applications.
- While the proposal aims to make it easier for beginners to understand
concepts like watermarks, operator state, and output rows, it could
potentially increase the learning curve due to the introduction of
additional terminology and information.
- Users might benefit from the ability to selectively enable or disable the
display of certain metadata elements to tailor the console output to their
specific needs. However, this introduces additional complexity.

As usual with these things, your mileage varies. Whilst the proposed
enhancements offer valuable insights into the behavior of Structured
Streaming, we ought to think about the potential downsides, particularly in
terms of increased verbosity, complexity, and the impact on user experience

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
 wrote:

> Hi all,
>
> I'd like to propose the idea of enhancing Structured Streaming's console
> sink to print event-time metrics and state store data, in addition to the
> sink's rows.
>
> I've noticed beginners often struggle to understand how watermarks,
> operator state, and output rows are all intertwined. By printing all of
> this information in the same place, I think that this sink will make it
> easier for users to see—and our docs to explain—how these concepts work
> together.
>
> For example, our docs could walk the users through a query with a
> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
> second watermark. After processing something like (foo, 17) and (bar, 15),
> writing another record (baz, 36) to the source would cause the following to
> print for batch 2:
>
> ++
>
> |  WRITES TO SINK (Batch = 2)|
>
> +--+-+
>
> |  window  |   count |
>
> +--+-+
>
> | {10 seconds, 20 seconds} |  2  |
>
> +--+-+
>
> | EVENT TIME |
>
> ++
>
> | watermark -> 21 seconds|
>
> | numDroppedRows -> 0|
>
> ++
>
> | STATE ROWS |
>
> +--+-+
>
> |   key|value|
>
> +--+-+
>
> | {30 seconds, 40 seconds} | {1} |
>
> +--+-+
>
> From this (especially with expository help), it would be more apparent
> that the record at 36 seconds did three things: it advanced the watermark
> to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put
> into the state for [30, 40].
>
> One valid concern is that this sink would now be printing *metadata*, not
> just data: will users think that Structured Streaming writes metadata to
> sinks? Perhaps. But I think that we can clarify that in the documentation
> of the console sink.
>
> Finally, the specific behavior for handling queries with multiple stateful
> operations, joins, and (F)MGWS can be handled in a subsequent design
> discussion if the general idea 

Community over Code EU 2024 Travel Assistance Applications now open!

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers!

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code EU 2024 are now
open!

We will be supporting Community over Code EU, Bratislava, Slovakia,
June 3th - 5th, 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Friday, March 1st, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

For those that will need a Visa to enter the Country - we advise you apply
now so that you have enough time in case of interview delays. So do not
wait until you know if you have been accepted or not.

We look forward to greeting many of you in Bratislava, Slovakia in June,
2024!

Kind Regards,

Gavin

(On behalf of the Travel Assistance Committee)


[no subject]

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers!

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code EU 2024 are now
open!

We will be supporting Community over Code EU, Bratislava, Slovakia,
June 3th - 5th, 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Friday, March 1st, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

For those that will need a Visa to enter the Country - we advise you apply
now so that you have enough time in case of interview delays. So do not
wait until you know if you have been accepted or not.

We look forward to greeting many of you in Bratislava, Slovakia in June,
2024!

Kind Regards,

Gavin

(On behalf of the Travel Assistance Committee)