Re: Enhanced Console Sink for Structured Streaming

2024-03-12 Thread Neil Ramaswamy
For advanced users, it's certainly an option to look at the streaming query
progress and use the state store reader to look at your state. However, the
goal of this Enhanced Console Sink is to improve the experience for
*new *users,
i.e. it should work mostly out of the box.

Let's move discussion to the JIRA (SPARK-47362
), since most
participants here are in alignment.

On Tue, Mar 12, 2024 at 7:25 AM Mich Talebzadeh 
wrote:

> OK I have just been working on  a Databricks engineering question raised
> by a user
>
> Monitoring structure streaming in external sink
> 
>
> In practice there is an option to use *StreamingQueryListener* from 
> *pyspark.sql.streaming
> import DataStreamWriter, StreamingQueryListene*r to get the matrix out
> fpreachBatch
>
> For example
> onQueryProgress
> microbatch_data received
> {
> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
> "name" : null,
> "timestamp" : "2024-03-10T09:21:27.233Z",
> "batchId" : 21,
> "numInputRows" : 1,
> "inputRowsPerSecond" : 100.0,
> "processedRowsPerSecond" : 5.347593582887701,
> "durationMs" : {
> "addBatch" : 37,
> "commitOffsets" : 41,
> "getBatch" : 0,
> "latestOffset" : 0,
> "queryPlanning" : 5,
> "triggerExecution" : 187,
> "walCommit" : 104
> },
> "stateOperators" : [ ],
> "sources" : [ {
> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
> numPartitions=default",
> etc
>
> Will that help?
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Fri, 9 Feb 2024 at 22:39, Neil Ramaswamy 
> wrote:
>
>> Thanks for the comments, Anish and Jerry. To summarize so far, we are in
>> agreement that:
>>
>> 1. Enhanced console sink is a good tool for new users to understand
>> Structured Streaming semantics
>> 2. It should be opt-in via an option (unlike my original proposal)
>> 3. Out of the 2 modes of verbosity I proposed, we're fine with the first
>> mode for now (print sink data with event-time metadata and state data for
>> stateful queries, with duration-rendered timestamps, with just the
>> KeyWithIndexToValue state store for joins, and with a state table for every
>> stateful operator, if there are multiple).
>>
>> I think the last pending suggestion (from Raghu, Anish, and Jerry) is how
>> to structure the output so that it's clear what is data and what is
>> metadata. Here's my proposal:
>>
>> --
>> BATCH: 1
>> --
>>
>> ++
>> |   ROWS WRITTEN TO SINK |
>> +--+-+
>> |  window  |   count |
>> +--+-+
>> | {10 seconds, 20 seconds} |  2  |
>> +--+-+
>>
>> ++
>> |   EVENT TIME METADATA  |
>> ++
>> | watermark -> 21 seconds|
>> | numDroppedRows -> 0|
>> ++
>>
>> ++
>> |  ROWS IN STATE STORE   |
>> +--+-+
>> |   key|value|
>> +--+-+
>> | {30 seconds, 40 seconds} | {1} |
>> +--+-+
>>
>> If there are no more major concerns, I think we can discuss smaller
>> details in the JIRA ticket or PR itself. I don't think a SPIP is needed for
>> a flag-gated benign change like this, but please let me know if you
>> disagree.
>>
>> Best,
>> Neil
>>
>> On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng 
>> wrote:
>>
>>> I am generally a +1 on this as we can use this information in our docs
>>> to demonstrate certains concepts to potential users.
>>>
>>> I am in agreement with other reviewers that we should keep the existing
>>> default behavior of the console sink.  This new style of output should be
>>> enabled behind a flag.
>>>
>>> As for the output of this "new mode" in the console sink, can we be more
>>> explicit about what is the actual output and what is the metadata?  It is
>>> not clear from the logged output.
>>>
>>> On Tue, Feb 6, 2024 at 11:08 

Re: Enhanced Console Sink for Structured Streaming

2024-03-12 Thread Mich Talebzadeh
OK I have just been working on  a Databricks engineering question raised by
a user

Monitoring structure streaming in external sink


In practice there is an option to use *StreamingQueryListener* from
*pyspark.sql.streaming
import DataStreamWriter, StreamingQueryListene*r to get the matrix out
fpreachBatch

For example
onQueryProgress
microbatch_data received
{
"id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
"runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
"name" : null,
"timestamp" : "2024-03-10T09:21:27.233Z",
"batchId" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701,
"durationMs" : {
"addBatch" : 37,
"commitOffsets" : 41,
"getBatch" : 0,
"latestOffset" : 0,
"queryPlanning" : 5,
"triggerExecution" : 187,
"walCommit" : 104
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
numPartitions=default",
etc

Will that help?

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 9 Feb 2024 at 22:39, Neil Ramaswamy 
wrote:

> Thanks for the comments, Anish and Jerry. To summarize so far, we are in
> agreement that:
>
> 1. Enhanced console sink is a good tool for new users to understand
> Structured Streaming semantics
> 2. It should be opt-in via an option (unlike my original proposal)
> 3. Out of the 2 modes of verbosity I proposed, we're fine with the first
> mode for now (print sink data with event-time metadata and state data for
> stateful queries, with duration-rendered timestamps, with just the
> KeyWithIndexToValue state store for joins, and with a state table for every
> stateful operator, if there are multiple).
>
> I think the last pending suggestion (from Raghu, Anish, and Jerry) is how
> to structure the output so that it's clear what is data and what is
> metadata. Here's my proposal:
>
> --
> BATCH: 1
> --
>
> ++
> |   ROWS WRITTEN TO SINK |
> +--+-+
> |  window  |   count |
> +--+-+
> | {10 seconds, 20 seconds} |  2  |
> +--+-+
>
> ++
> |   EVENT TIME METADATA  |
> ++
> | watermark -> 21 seconds|
> | numDroppedRows -> 0|
> ++
>
> ++
> |  ROWS IN STATE STORE   |
> +--+-+
> |   key|value|
> +--+-+
> | {30 seconds, 40 seconds} | {1} |
> +--+-+
>
> If there are no more major concerns, I think we can discuss smaller
> details in the JIRA ticket or PR itself. I don't think a SPIP is needed for
> a flag-gated benign change like this, but please let me know if you
> disagree.
>
> Best,
> Neil
>
> On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng 
> wrote:
>
>> I am generally a +1 on this as we can use this information in our docs to
>> demonstrate certains concepts to potential users.
>>
>> I am in agreement with other reviewers that we should keep the existing
>> default behavior of the console sink.  This new style of output should be
>> enabled behind a flag.
>>
>> As for the output of this "new mode" in the console sink, can we be more
>> explicit about what is the actual output and what is the metadata?  It is
>> not clear from the logged output.
>>
>> On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
>>  wrote:
>>
>>> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose
>>> mode being off by default.
>>>
>>> I think it's reasonable to have 1 or 2 levels of verbosity:
>>>
>>>1. The first verbose mode could target new users, and take a highly
>>>opinionated view on what's important to understand streaming semantics.
>>>This would include printing the sink rows, watermark, number of dropped
>>>rows (if any), and state data. For state data, we should print for all
>>>state stores (for multiple stateful operators), but for joins, I think
>>>rendering just the KeyWithIndexToValueStore(s) is 

Re: Enhanced Console Sink for Structured Streaming

2024-02-09 Thread Neil Ramaswamy
Thanks for the comments, Anish and Jerry. To summarize so far, we are in
agreement that:

1. Enhanced console sink is a good tool for new users to understand
Structured Streaming semantics
2. It should be opt-in via an option (unlike my original proposal)
3. Out of the 2 modes of verbosity I proposed, we're fine with the first
mode for now (print sink data with event-time metadata and state data for
stateful queries, with duration-rendered timestamps, with just the
KeyWithIndexToValue state store for joins, and with a state table for every
stateful operator, if there are multiple).

I think the last pending suggestion (from Raghu, Anish, and Jerry) is how
to structure the output so that it's clear what is data and what is
metadata. Here's my proposal:

--
BATCH: 1
--

++
|   ROWS WRITTEN TO SINK |
+--+-+
|  window  |   count |
+--+-+
| {10 seconds, 20 seconds} |  2  |
+--+-+

++
|   EVENT TIME METADATA  |
++
| watermark -> 21 seconds|
| numDroppedRows -> 0|
++

++
|  ROWS IN STATE STORE   |
+--+-+
|   key|value|
+--+-+
| {30 seconds, 40 seconds} | {1} |
+--+-+

If there are no more major concerns, I think we can discuss smaller details
in the JIRA ticket or PR itself. I don't think a SPIP is needed for a
flag-gated benign change like this, but please let me know if you disagree.

Best,
Neil

On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng 
wrote:

> I am generally a +1 on this as we can use this information in our docs to
> demonstrate certains concepts to potential users.
>
> I am in agreement with other reviewers that we should keep the existing
> default behavior of the console sink.  This new style of output should be
> enabled behind a flag.
>
> As for the output of this "new mode" in the console sink, can we be more
> explicit about what is the actual output and what is the metadata?  It is
> not clear from the logged output.
>
> On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
>  wrote:
>
>> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
>> being off by default.
>>
>> I think it's reasonable to have 1 or 2 levels of verbosity:
>>
>>1. The first verbose mode could target new users, and take a highly
>>opinionated view on what's important to understand streaming semantics.
>>This would include printing the sink rows, watermark, number of dropped
>>rows (if any), and state data. For state data, we should print for all
>>state stores (for multiple stateful operators), but for joins, I think
>>rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>>would render as durations (see original message) to make small examples
>>easy to understand.
>>2. The second verbose mode could target more advanced users trying to
>>create a reproduction. In addition to the first verbose mode, it would 
>> also
>>print the other join state store, the number of evicted rows due to the
>>watermark, and print timestamps as extended ISO 8601 strings (same as
>>today).
>>
>> Rather than implementing both, I would prefer to implement the first
>> level, and evaluate later if the second would be useful.
>>
>> Mich, can you elaborate on why you don't think it's useful? To reiterate,
>> this proposal is to bring to light certain metrics/values that are
>> essential for understanding SS micro-batching semantics. It's to help users
>> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
>> rendering sink data or state store values—there should be no sensitive user
>> data there.)
>>
>> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I don't think adding this to the streaming flow (at micro level) will be
>>> that useful
>>>
>>> However, this can be added to Spark UI as an enhancement to
>>> the Streaming Query Statistics page.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. 

Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Jerry Peng
I am generally a +1 on this as we can use this information in our docs to
demonstrate certains concepts to potential users.

I am in agreement with other reviewers that we should keep the existing
default behavior of the console sink.  This new style of output should be
enabled behind a flag.

As for the output of this "new mode" in the console sink, can we be more
explicit about what is the actual output and what is the metadata?  It is
not clear from the logged output.

On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
 wrote:

> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
> being off by default.
>
> I think it's reasonable to have 1 or 2 levels of verbosity:
>
>1. The first verbose mode could target new users, and take a highly
>opinionated view on what's important to understand streaming semantics.
>This would include printing the sink rows, watermark, number of dropped
>rows (if any), and state data. For state data, we should print for all
>state stores (for multiple stateful operators), but for joins, I think
>rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>would render as durations (see original message) to make small examples
>easy to understand.
>2. The second verbose mode could target more advanced users trying to
>create a reproduction. In addition to the first verbose mode, it would also
>print the other join state store, the number of evicted rows due to the
>watermark, and print timestamps as extended ISO 8601 strings (same as
>today).
>
> Rather than implementing both, I would prefer to implement the first
> level, and evaluate later if the second would be useful.
>
> Mich, can you elaborate on why you don't think it's useful? To reiterate,
> this proposal is to bring to light certain metrics/values that are
> essential for understanding SS micro-batching semantics. It's to help users
> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
> rendering sink data or state store values—there should be no sensitive user
> data there.)
>
> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh 
> wrote:
>
>> I don't think adding this to the streaming flow (at micro level) will be
>> that useful
>>
>> However, this can be added to Spark UI as an enhancement to the Streaming
>> Query Statistics page.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
>> wrote:
>>
>>> Agree, the default behavior does not need to change.
>>>
>>> Neil, how about separating it into two sections:
>>>
>>>- Actual rows in the sink (same as current output)
>>>- Followed by metadata data
>>>
>>>


Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Anish Shrigondekar
Hi Neil,

Thanks for putting this together. +1 to the proposal of enhancing the
console sink further. I think it will help new users understand some of the
streaming/micro-batch semantics a bit better in Spark.

Agree with not having verbose mode enabled by default. I think step 1
described above sounds most useful/relevant to the large majority of users.
Might be useful to have a clean separation of the output rows (what the
sink produces) vs query metadata (watermark etc) vs state store data
maintained by the query for stateful queries, so that it's easier to
understand/consume.

Thanks,
Anish

On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
 wrote:

> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
> being off by default.
>
> I think it's reasonable to have 1 or 2 levels of verbosity:
>
>1. The first verbose mode could target new users, and take a highly
>opinionated view on what's important to understand streaming semantics.
>This would include printing the sink rows, watermark, number of dropped
>rows (if any), and state data. For state data, we should print for all
>state stores (for multiple stateful operators), but for joins, I think
>rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>would render as durations (see original message) to make small examples
>easy to understand.
>2. The second verbose mode could target more advanced users trying to
>create a reproduction. In addition to the first verbose mode, it would also
>print the other join state store, the number of evicted rows due to the
>watermark, and print timestamps as extended ISO 8601 strings (same as
>today).
>
> Rather than implementing both, I would prefer to implement the first
> level, and evaluate later if the second would be useful.
>
> Mich, can you elaborate on why you don't think it's useful? To reiterate,
> this proposal is to bring to light certain metrics/values that are
> essential for understanding SS micro-batching semantics. It's to help users
> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
> rendering sink data or state store values—there should be no sensitive user
> data there.)
>
> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh 
> wrote:
>
>> I don't think adding this to the streaming flow (at micro level) will be
>> that useful
>>
>> However, this can be added to Spark UI as an enhancement to the Streaming
>> Query Statistics page.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
>> wrote:
>>
>>> Agree, the default behavior does not need to change.
>>>
>>> Neil, how about separating it into two sections:
>>>
>>>- Actual rows in the sink (same as current output)
>>>- Followed by metadata data
>>>
>>>


Re: Enhanced Console Sink for Structured Streaming

2024-02-06 Thread Neil Ramaswamy
Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
being off by default.

I think it's reasonable to have 1 or 2 levels of verbosity:

   1. The first verbose mode could target new users, and take a highly
   opinionated view on what's important to understand streaming semantics.
   This would include printing the sink rows, watermark, number of dropped
   rows (if any), and state data. For state data, we should print for all
   state stores (for multiple stateful operators), but for joins, I think
   rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
   would render as durations (see original message) to make small examples
   easy to understand.
   2. The second verbose mode could target more advanced users trying to
   create a reproduction. In addition to the first verbose mode, it would also
   print the other join state store, the number of evicted rows due to the
   watermark, and print timestamps as extended ISO 8601 strings (same as
   today).

Rather than implementing both, I would prefer to implement the first level,
and evaluate later if the second would be useful.

Mich, can you elaborate on why you don't think it's useful? To reiterate,
this proposal is to bring to light certain metrics/values that are
essential for understanding SS micro-batching semantics. It's to help users
go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
rendering sink data or state store values—there should be no sensitive user
data there.)

On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh 
wrote:

> I don't think adding this to the streaming flow (at micro level) will be
> that useful
>
> However, this can be added to Spark UI as an enhancement to the Streaming
> Query Statistics page.
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
> wrote:
>
>> Agree, the default behavior does not need to change.
>>
>> Neil, how about separating it into two sections:
>>
>>- Actual rows in the sink (same as current output)
>>- Followed by metadata data
>>
>>


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Mich Talebzadeh
I don't think adding this to the streaming flow (at micro level) will be
that useful

However, this can be added to Spark UI as an enhancement to the Streaming
Query Statistics page.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
wrote:

> Agree, the default behavior does not need to change.
>
> Neil, how about separating it into two sections:
>
>- Actual rows in the sink (same as current output)
>- Followed by metadata data
>
>


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Raghu Angadi
Agree, the default behavior does not need to change.

Neil, how about separating it into two sections:

   - Actual rows in the sink (same as current output)
   - Followed by metadata data


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Jungtaek Lim
Maybe we could keep the default as it is, and explicitly turn on
verboseMode to enable auxiliary information. I'm not a believer that anyone
will parse the output of console sink (which means this could be a breaking
change), but changing the default behavior should be taken conservatively.
We can highlight the mode on the guide doc, which would be good enough to
publicize the improvement.

Other than that, the proposal looks good to me. Adding some more details
may be appropriate - e.g. what if there are multiple stateful operators,
what if there are 100 state rows in the state store, etc. One sketched idea
is to employ multiple verbosity levels and list up all state store rows in
full verbosity, otherwise maybe the number of state store rows. This is
just one example for the details.

On Sun, Feb 4, 2024 at 3:22 AM Neil Ramaswamy
 wrote:

> Re: verbosity: yes, it will be more verbose. A config I was planning to
> implement was a default-on console sink option, verboseMode, that you can
> set to be off if you just want sink data. I don't think that introduces
> additional complexity, as the last point suggests. (And also, nobody should
> be using this for "high data throughput" scenarios or
> "performance-sensitive applications". It's a development sink.)
>
> I don't think that exposing these details increases the learning curve:
> these details are *essential *for understanding how Structured Streaming
> works. I'd actually argue that it makes the learning curve shallower: by
> showing the few variables that affect the behavior of their pipelines,
> they'll have the conceptual understanding to answer essential questions
> like "why aren't my results showing up?" or "why is my state size always
> increasing?"
>
> Also: for stateless pipelines, none of this event-time and state detail
> applies. We would just render sink data—no behavior change from today. That
> seems gentle enough to me: start with stateless pipelines and see
> the output rows, but when you advance to stateful pipelines, you need to
> deal with the two complexities (event-time and state) of stateful streaming.
>
> On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> As I understood, the proposal you mentioned suggests adding event-time
>> and state store metadata to the console sink to better highlight the
>> semantics of the Structured Streaming engine. While I agree this
>> enhancement can provide valuable insights into the engine's behavior
>> especially for newcomers, there are potential challenges that we need to be
>> aware of:
>>
>> - Including additional metadata in the console sink output can increase
>> the volume of information printed. This might result in a more verbose
>> console output, making it harder to observe the actual data from the
>> metadata, especially in scenarios with high data throughput.
>> - Added verbosity, the proposed additional metadata may make the console
>> output more verbose, potentially affecting its readability, especially for
>> users who are primarily interested in the processed data and not the
>> internal engine details.
>> - Users unfamiliar with the internal workings of Structured Streaming
>> might misinterpret the metadata as part of the actual data, leading to
>> confusion.
>> - The act of printing additional metadata to the console may introduce
>> some overhead, especially in scenarios where high-frequency updates occur.
>> While this overhead might be minimal, it is worth considering it in
>> performance-sensitive applications.
>> - While the proposal aims to make it easier for beginners to understand
>> concepts like watermarks, operator state, and output rows, it could
>> potentially increase the learning curve due to the introduction of
>> additional terminology and information.
>> - Users might benefit from the ability to selectively enable or disable
>> the display of certain metadata elements to tailor the console output to
>> their specific needs. However, this introduces additional complexity.
>>
>> As usual with these things, your mileage varies. Whilst the proposed
>> enhancements offer valuable insights into the behavior of Structured
>> Streaming, we ought to think about the potential downsides, particularly in
>> terms of increased verbosity, complexity, and the impact on user experience
>>
>> HTH
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 3 Feb 2024 at 01:32, Neil 

Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Neil Ramaswamy
Re: verbosity: yes, it will be more verbose. A config I was planning to
implement was a default-on console sink option, verboseMode, that you can
set to be off if you just want sink data. I don't think that introduces
additional complexity, as the last point suggests. (And also, nobody should
be using this for "high data throughput" scenarios or
"performance-sensitive applications". It's a development sink.)

I don't think that exposing these details increases the learning curve:
these details are *essential *for understanding how Structured Streaming
works. I'd actually argue that it makes the learning curve shallower: by
showing the few variables that affect the behavior of their pipelines,
they'll have the conceptual understanding to answer essential questions
like "why aren't my results showing up?" or "why is my state size always
increasing?"

Also: for stateless pipelines, none of this event-time and state detail
applies. We would just render sink data—no behavior change from today. That
seems gentle enough to me: start with stateless pipelines and see
the output rows, but when you advance to stateful pipelines, you need to
deal with the two complexities (event-time and state) of stateful streaming.

On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh 
wrote:

> Hi,
>
> As I understood, the proposal you mentioned suggests adding event-time
> and state store metadata to the console sink to better highlight the
> semantics of the Structured Streaming engine. While I agree this
> enhancement can provide valuable insights into the engine's behavior
> especially for newcomers, there are potential challenges that we need to be
> aware of:
>
> - Including additional metadata in the console sink output can increase
> the volume of information printed. This might result in a more verbose
> console output, making it harder to observe the actual data from the
> metadata, especially in scenarios with high data throughput.
> - Added verbosity, the proposed additional metadata may make the console
> output more verbose, potentially affecting its readability, especially for
> users who are primarily interested in the processed data and not the
> internal engine details.
> - Users unfamiliar with the internal workings of Structured Streaming
> might misinterpret the metadata as part of the actual data, leading to
> confusion.
> - The act of printing additional metadata to the console may introduce
> some overhead, especially in scenarios where high-frequency updates occur.
> While this overhead might be minimal, it is worth considering it in
> performance-sensitive applications.
> - While the proposal aims to make it easier for beginners to understand
> concepts like watermarks, operator state, and output rows, it could
> potentially increase the learning curve due to the introduction of
> additional terminology and information.
> - Users might benefit from the ability to selectively enable or disable
> the display of certain metadata elements to tailor the console output to
> their specific needs. However, this introduces additional complexity.
>
> As usual with these things, your mileage varies. Whilst the proposed
> enhancements offer valuable insights into the behavior of Structured
> Streaming, we ought to think about the potential downsides, particularly in
> terms of increased verbosity, complexity, and the impact on user experience
>
> HTH
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
>  wrote:
>
>> Hi all,
>>
>> I'd like to propose the idea of enhancing Structured Streaming's console
>> sink to print event-time metrics and state store data, in addition to the
>> sink's rows.
>>
>> I've noticed beginners often struggle to understand how watermarks,
>> operator state, and output rows are all intertwined. By printing all of
>> this information in the same place, I think that this sink will make it
>> easier for users to see—and our docs to explain—how these concepts work
>> together.
>>
>> For example, our docs could walk the users through a query with a
>> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
>> second watermark. After processing something like (foo, 17) and (bar, 15),
>> writing another record (baz, 36) to the source would cause the following to
>> print for batch 2:
>>
>> ++
>>
>> |  WRITES TO SINK (Batch = 2)|
>>
>> 

Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Mich Talebzadeh
Hi,

As I understood, the proposal you mentioned suggests adding event-time and
state store metadata to the console sink to better highlight the semantics
of the Structured Streaming engine. While I agree this enhancement can
provide valuable insights into the engine's behavior especially for
newcomers, there are potential challenges that we need to be aware of:

- Including additional metadata in the console sink output can increase the
volume of information printed. This might result in a more verbose console
output, making it harder to observe the actual data from the metadata,
especially in scenarios with high data throughput.
- Added verbosity, the proposed additional metadata may make the console
output more verbose, potentially affecting its readability, especially for
users who are primarily interested in the processed data and not the
internal engine details.
- Users unfamiliar with the internal workings of Structured Streaming might
misinterpret the metadata as part of the actual data, leading to confusion.
- The act of printing additional metadata to the console may introduce some
overhead, especially in scenarios where high-frequency updates occur. While
this overhead might be minimal, it is worth considering it in
performance-sensitive applications.
- While the proposal aims to make it easier for beginners to understand
concepts like watermarks, operator state, and output rows, it could
potentially increase the learning curve due to the introduction of
additional terminology and information.
- Users might benefit from the ability to selectively enable or disable the
display of certain metadata elements to tailor the console output to their
specific needs. However, this introduces additional complexity.

As usual with these things, your mileage varies. Whilst the proposed
enhancements offer valuable insights into the behavior of Structured
Streaming, we ought to think about the potential downsides, particularly in
terms of increased verbosity, complexity, and the impact on user experience

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
 wrote:

> Hi all,
>
> I'd like to propose the idea of enhancing Structured Streaming's console
> sink to print event-time metrics and state store data, in addition to the
> sink's rows.
>
> I've noticed beginners often struggle to understand how watermarks,
> operator state, and output rows are all intertwined. By printing all of
> this information in the same place, I think that this sink will make it
> easier for users to see—and our docs to explain—how these concepts work
> together.
>
> For example, our docs could walk the users through a query with a
> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
> second watermark. After processing something like (foo, 17) and (bar, 15),
> writing another record (baz, 36) to the source would cause the following to
> print for batch 2:
>
> ++
>
> |  WRITES TO SINK (Batch = 2)|
>
> +--+-+
>
> |  window  |   count |
>
> +--+-+
>
> | {10 seconds, 20 seconds} |  2  |
>
> +--+-+
>
> | EVENT TIME |
>
> ++
>
> | watermark -> 21 seconds|
>
> | numDroppedRows -> 0|
>
> ++
>
> | STATE ROWS |
>
> +--+-+
>
> |   key|value|
>
> +--+-+
>
> | {30 seconds, 40 seconds} | {1} |
>
> +--+-+
>
> From this (especially with expository help), it would be more apparent
> that the record at 36 seconds did three things: it advanced the watermark
> to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put
> into the state for [30, 40].
>
> One valid concern is that this sink would now be printing *metadata*, not
> just data: will users think that Structured Streaming writes metadata to
> sinks? Perhaps. But I think that we can clarify that in the documentation
> of the console sink.
>
> Finally, the specific behavior for handling queries with multiple stateful
> operations, joins, and (F)MGWS can be handled in a subsequent design
> discussion if the general idea 

Enhanced Console Sink for Structured Streaming

2024-02-02 Thread Neil Ramaswamy
Hi all,

I'd like to propose the idea of enhancing Structured Streaming's console
sink to print event-time metrics and state store data, in addition to the
sink's rows.

I've noticed beginners often struggle to understand how watermarks,
operator state, and output rows are all intertwined. By printing all of
this information in the same place, I think that this sink will make it
easier for users to see—and our docs to explain—how these concepts work
together.

For example, our docs could walk the users through a query with a 10-second
tumbling window aggregation (e.g. with a .count()) and a 15 second
watermark. After processing something like (foo, 17) and (bar, 15), writing
another record (baz, 36) to the source would cause the following to print
for batch 2:

++

|  WRITES TO SINK (Batch = 2)|

+--+-+

|  window  |   count |

+--+-+

| {10 seconds, 20 seconds} |  2  |

+--+-+

| EVENT TIME |

++

| watermark -> 21 seconds|

| numDroppedRows -> 0|

++

| STATE ROWS |

+--+-+

|   key|value|

+--+-+

| {30 seconds, 40 seconds} | {1} |

+--+-+

>From this (especially with expository help), it would be more apparent that
the record at 36 seconds did three things: it advanced the watermark to
36-15 = 21 seconds, caused the [10, 20] window to close, and was put into
the state for [30, 40].

One valid concern is that this sink would now be printing *metadata*, not
just data: will users think that Structured Streaming writes metadata to
sinks? Perhaps. But I think that we can clarify that in the documentation
of the console sink.

Finally, the specific behavior for handling queries with multiple stateful
operations, joins, and (F)MGWS can be handled in a subsequent design
discussion if the general idea is appreciated.

*TLDR: I propose adding event-time and state store metadata to the console
sink to better highlight the semantics of the Structured Streaming engine. *

Neil