Re: Watermark on late data only

2023-10-10 Thread Raghu Angadi
I like some way to expose watermarks to the user. It does affect the processing of the records, so it is relevant for the users. `current_watermark()` is a good option. The implementation of this might be engine specific. But it is a very relevant concept for authors of streaming pipelines.

Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
slight correction/clarification: We now take the "previous" watermark to determine the late record, because they are valid inputs for non-first stateful operators dropping records based on the same criteria would drop valid records from previous (upstream) stateful operators. Please look back

Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
We wouldn't like to expose the internal mechanism to the public. As you are a very detail oriented engineer tracking major changes, you might notice that we "changed" the definition of late record while fixing late records. Previously the late record is defined as a record having event time

Re: Watermark on late data only

2023-10-10 Thread Bartosz Konieczny
Thank you for the clarification, Jungtaek  Indeed, it doesn't sound like a highly demanded feature from the end users, haven't seen that a lot on StackOverflow or mailing lists. I was just curious about the reasons. Using the arbitrary stateful processing could be indeed a workaround! But IMHO

Delimited identifiers.

2023-10-10 Thread Virgil Artimon Palanciuc
Apologies if this has been discussed before, I searched but couldn’t find it. What is the rationale behind picking backticks for identifier delimiters in spark? In the [SQL 92 spec]( https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), the delimited identifier is unambiguously defined