Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Mich Talebzadeh
I don't think adding this to the streaming flow (at micro level) will be
that useful

However, this can be added to Spark UI as an enhancement to the Streaming
Query Statistics page.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
wrote:

> Agree, the default behavior does not need to change.
>
> Neil, how about separating it into two sections:
>
>- Actual rows in the sink (same as current output)
>- Followed by metadata data
>
>


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Raghu Angadi
Agree, the default behavior does not need to change.

Neil, how about separating it into two sections:

   - Actual rows in the sink (same as current output)
   - Followed by metadata data


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Jungtaek Lim
Maybe we could keep the default as it is, and explicitly turn on
verboseMode to enable auxiliary information. I'm not a believer that anyone
will parse the output of console sink (which means this could be a breaking
change), but changing the default behavior should be taken conservatively.
We can highlight the mode on the guide doc, which would be good enough to
publicize the improvement.

Other than that, the proposal looks good to me. Adding some more details
may be appropriate - e.g. what if there are multiple stateful operators,
what if there are 100 state rows in the state store, etc. One sketched idea
is to employ multiple verbosity levels and list up all state store rows in
full verbosity, otherwise maybe the number of state store rows. This is
just one example for the details.

On Sun, Feb 4, 2024 at 3:22 AM Neil Ramaswamy
 wrote:

> Re: verbosity: yes, it will be more verbose. A config I was planning to
> implement was a default-on console sink option, verboseMode, that you can
> set to be off if you just want sink data. I don't think that introduces
> additional complexity, as the last point suggests. (And also, nobody should
> be using this for "high data throughput" scenarios or
> "performance-sensitive applications". It's a development sink.)
>
> I don't think that exposing these details increases the learning curve:
> these details are *essential *for understanding how Structured Streaming
> works. I'd actually argue that it makes the learning curve shallower: by
> showing the few variables that affect the behavior of their pipelines,
> they'll have the conceptual understanding to answer essential questions
> like "why aren't my results showing up?" or "why is my state size always
> increasing?"
>
> Also: for stateless pipelines, none of this event-time and state detail
> applies. We would just render sink data—no behavior change from today. That
> seems gentle enough to me: start with stateless pipelines and see
> the output rows, but when you advance to stateful pipelines, you need to
> deal with the two complexities (event-time and state) of stateful streaming.
>
> On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> As I understood, the proposal you mentioned suggests adding event-time
>> and state store metadata to the console sink to better highlight the
>> semantics of the Structured Streaming engine. While I agree this
>> enhancement can provide valuable insights into the engine's behavior
>> especially for newcomers, there are potential challenges that we need to be
>> aware of:
>>
>> - Including additional metadata in the console sink output can increase
>> the volume of information printed. This might result in a more verbose
>> console output, making it harder to observe the actual data from the
>> metadata, especially in scenarios with high data throughput.
>> - Added verbosity, the proposed additional metadata may make the console
>> output more verbose, potentially affecting its readability, especially for
>> users who are primarily interested in the processed data and not the
>> internal engine details.
>> - Users unfamiliar with the internal workings of Structured Streaming
>> might misinterpret the metadata as part of the actual data, leading to
>> confusion.
>> - The act of printing additional metadata to the console may introduce
>> some overhead, especially in scenarios where high-frequency updates occur.
>> While this overhead might be minimal, it is worth considering it in
>> performance-sensitive applications.
>> - While the proposal aims to make it easier for beginners to understand
>> concepts like watermarks, operator state, and output rows, it could
>> potentially increase the learning curve due to the introduction of
>> additional terminology and information.
>> - Users might benefit from the ability to selectively enable or disable
>> the display of certain metadata elements to tailor the console output to
>> their specific needs. However, this introduces additional complexity.
>>
>> As usual with these things, your mileage varies. Whilst the proposed
>> enhancements offer valuable insights into the behavior of Structured
>> Streaming, we ought to think about the potential downsides, particularly in
>> terms of increased verbosity, complexity, and the impact on user experience
>>
>> HTH
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 3 Feb 2024 at 01:32, Neil 

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-05 Thread Jungtaek Lim
Thanks all for the positive feedback! Will figure out time to go through
the RC process. Stay tuned!

On Mon, Feb 5, 2024 at 7:46 AM Gengliang Wang  wrote:

> +1
>
> On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala  wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge  wrote:
>>
>>> +1
>>>
>>> John Zhuge
>>>
>>>
>>> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
>>>  wrote:
>>>
 +1

 On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
 wrote:

> +1
>
> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>
>> +1
>>
>>
>>
>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>
>> +1
>>
>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>> wrote:
>>
>>> +1
>>>
>>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
>>> 写入:
>>>
>>>
>>> +1
>>>
>>>
>>> Jungtaek Lim >> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>>> >
>>> > Hi dev,
>>> >
>>> > looks like there are a huge number of commits being pushed to
>>> branch-3.5 after 3.5.0 was released, 200+ commits.
>>> >
>>> > $ git log --oneline v3.5.0..HEAD | wc -l
>>> > 202
>>> >
>>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed
>>> version, and 10 resolved issues are either marked as blocker (even
>>> correctness issues) or critical, which justifies the release.
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>>> >
>>> > What do you think about releasing 3.5.1 with the current head of
>>> branch-3.5? I'm happy to volunteer as the release manager.
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> dev-unsubscr...@spark.apache.org>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
>
>


Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-05 Thread kalyan
Hey,
Disk space not enough is also a reliability concern, but might need a diff
strategy to handle it.
As suggested by Mridul, I am working on making things more configurable in
another(new) module… with that, we can plug in new rules for each type of
error.

Regards
Kalyan.

On Mon, 5 Feb 2024 at 1:10 PM, Jay Han  wrote:

> Hi,
> what about supporting for solving the disk space problem of "device space
> isn't enough"? I think it's same as OOM exception.
>
> kalyan  于2024年1月27日周六 13:00写道:
>
>> Hi all,
>>
>
>> Sorry for the delay in getting the first draft of (my first) SPIP out.
>>
>> https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1
>>
>> Let me know what you think.
>>
>> Regards
>> kalyan.
>>
>> On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh  wrote:
>>
>>> Hey all,
>>>
>>> Thanks for this discussion, the timing of this couldn't be better!
>>>
>>> At Pinterest, we recently started to look into reducing OOM failures
>>> while also reducing memory consumption of spark applications. We considered
>>> the following options.
>>> 1. Changing core count on executor to change memory available per task
>>> in the executor.
>>> 2. Changing resource profile based on task failures and gc metrics to
>>> grow or shrink executor memory size. We do this at application level based
>>> on the app's past runs today.
>>> 3. K8s vertical pod autoscaler
>>> 
>>>
>>> Internally, we are mostly getting aligned on option 2. We would love to
>>> make this happen and are looking forward to the SPIP.
>>>
>>>
>>> On Wed, Jan 17, 2024 at 9:34 AM Mridul Muralidharan 
>>> wrote:
>>>

 Hi,

   We are internally exploring adding support for dynamically changing
 the resource profile of a stage based on runtime characteristics.
 This includes failures due to OOM and the like, slowness due to
 excessive GC, resource wastage due to excessive overprovisioning, etc.
 Essentially handles scale up and scale down of resources.
 Instead of baking these into the scheduler directly (which is already
 complex), we are modeling it as a plugin - so that the 'business logic' of
 how to handle task events and mutate state is pluggable.

 The main limitation I find with mutating only the cores is the limits
 it places on what kind of problems can be solved with it - and mutating
 resource profiles is a much more natural way to handle this
 (spark.task.cpus predates RP).

 Regards,
 Mridul

 On Wed, Jan 17, 2024 at 9:18 AM Tom Graves 
 wrote:

> It is interesting. I think there are definitely some discussion points
> around this.  reliability vs performance is always a trade off and its
> great it doesn't fail but if it doesn't meet someone's SLA now that could
> be as bad if its hard to figure out why.   I think if something like this
> kicks in, it needs to be very obvious to the user so they can see that it
> occurred.  Do you have something in place on UI or something that 
> indicates
> this? The nice thing is also you aren't wasting memory by increasing it 
> for
> all tasks when maybe you only need it for one or two.  The downside is you
> are only finding out after failure.
>
> I do also worry a little bit that in your blog post, the error you
> pointed out isn't a java OOM but an off heap memory issue (overhead + heap
> usage).  You don't really address heap memory vs off heap in that article.
> Only thing I see mentioned is spark.executor.memory which is heap memory.
> Obviously adjusting to only run one task is going to give that task more
> overall memory but the reasons its running out in the first place could be
> different.  If it was on heap memory for instance with more tasks I would
> expect to see more GC and not executor OOM.  If you are getting executor
> OOM you are likely using more off heap memory/stack space, etc then you
> allocated.   Ultimately it would be nice to know why that is happening and
> see if we can address it to not fail in the first place.  That could be
> extremely difficult though, especially if using software outside Spark 
> that
> is using that memory.
>
> As Holden said,  we need to make sure this would play nice with the
> resource profiles, or potentially if we can use the resource profile
> functionality.  Theoretically you could extend this to try to get new
> executor if using dynamic allocation for instance.
>
> I agree doing a SPIP would be a good place to start to have more
> discussions.
>
> Tom
>
> On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan <
> justfors...@gmail.com> wrote:
>
>
> Hello All,
>
> At Uber, we had recently, done some work on improving the reliability
> of spark applications in scenarios of