Re: Watermark on late data only

2023-10-09 Thread Jungtaek Lim
Technically speaking, "late data" represents the data which cannot be
processed due to the fact the engine threw out the state associated with
the data already.

That said, the only reason watermark does exist for streaming is to handle
stateful operators. From the engine's point of view, there is no concept
about "late data" for stateless query. It's something users have to
leverage "filter" by themselves, without relying on the value of watermark.
I guess someone may see some benefit of automatic tracking of trend for
event time and want to define late data based on the watermark even in
stateless query, but personally I don't hear about the request so far.

As a workaround you can leverage flatMapGroupsWithState which provides the
value of watermark for you, but I'd agree it's too heavyweight just to do
this. If we see consistent demand on it, we could probably look into it and
maybe introduce a new SQL function (which works only on streaming - that's
probably a major blocker on introduction) on it.

On Mon, Oct 9, 2023 at 11:03 AM Bartosz Konieczny 
wrote:

> Hi,
>
> I've been analyzing the watermark propagation added in the 3.5.0 recently
> and had to return to the basics of watermarks. One question is still
> unanswered in my head.
>
> Why are the watermarks reserved to stateful queries? Can't they apply to
> the filtering late date out only?
>
> The reason is only historical, as the initial design doc
> 
> mentions the aggregated queries exclusively? Or are there any technical
> limitations why writing the jobs like below don't drop late data
> automatically?
>
> import sparkSession.implicits._
> implicit val sparkContext = sparkSession.sqlContext
> val clicksStream = MemoryStream[Click]
> val clicksWithWatermark = clicksStream.toDF
>   .withWatermark("clickTime", "10 minutes")
> val query =
> clicksWithWatermark.writeStream.format("console").option("truncate", false)
>   .start()
>
> clicksStream.addData(Seq(
>   Click(1, Timestamp.valueOf("2023-06-10 10:10:00")),
>   Click(2, Timestamp.valueOf("2023-06-10 10:12:00")),
>   Click(3, Timestamp.valueOf("2023-06-10 10:14:00"))
> ))
>
>
> query.processAllAvailable()
>
> clicksStream.addData(Seq(
>   Click(4, Timestamp.valueOf("2023-06-10 11:00:40")),
>   Click(5, Timestamp.valueOf("2023-06-10 11:00:30")),
>   Click(6, Timestamp.valueOf("2023-06-10 11:00:10")),
>   Click(10, Timestamp.valueOf("2023-06-10 10:00:10"))
> ))
> query.processAllAvailable()
>
> One quick implementation could be adding a new physical plan rule to the
> IncrementalExecution
> 
> for the EventTimeWatermark node. That's a first thought, maybe too
> simplistic and hiding some pitfalls?
>
> Best,
> Bartosz.
> --
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>


Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-09 Thread Xinrong Meng
Congratulations!

On Mon, Oct 9, 2023 at 5:06 AM Kent Yao  wrote:

> Congrats!
>
> Kent
>
>
> 在 2023年10月7日星期六,John Zhuge  写道:
>
>> Congratulations!
>>
>> On Fri, Oct 6, 2023 at 6:41 PM Yi Wu 
>> wrote:
>>
>>> Congrats!
>>>
>>> On Sat, Oct 7, 2023 at 9:24 AM XiDuo You  wrote:
>>>
 Congratulations!

 Prashant Sharma  于2023年10月6日周五 00:26写道:
 >
 > Congratulations 
 >
 > On Wed, 4 Oct, 2023, 8:52 pm huaxin gao, 
 wrote:
 >>
 >> Congratulations!
 >>
 >> On Wed, Oct 4, 2023 at 7:39 AM Chao Sun  wrote:
 >>>
 >>> Congratulations!
 >>>
 >>> On Wed, Oct 4, 2023 at 5:11 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 
  Congrats!
 
  2023년 10월 4일 (수) 오후 5:04, yangjie01 님이
 작성:
 >
 > Congratulations!
 >
 >
 >
 > Jie Yang
 >
 >
 >
 > 发件人: Dongjoon Hyun 
 > 日期: 2023年10月4日 星期三 13:04
 > 收件人: Hyukjin Kwon 
 > 抄送: Hussein Awala , Rui Wang <
 amaliu...@apache.org>, Gengliang Wang , Xiao Li <
 gatorsm...@gmail.com>, "dev@spark.apache.org" 
 > 主题: Re: Welcome to Our New Apache Spark Committer and PMCs
 >
 >
 >
 > Congratulations!
 >
 >
 >
 > Dongjoon.
 >
 >
 >
 > On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon 
 wrote:
 >
 > Woohoo!
 >
 >
 >
 > On Tue, 3 Oct 2023 at 22:47, Hussein Awala 
 wrote:
 >
 > Congrats to all of you!
 >
 >
 >
 > On Tue 3 Oct 2023 at 08:15, Rui Wang 
 wrote:
 >
 > Congratulations! Well deserved!
 >
 >
 >
 > -Rui
 >
 >
 >
 >
 >
 > On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang 
 wrote:
 >
 > Congratulations to all! Well deserved!
 >
 >
 >
 > On Mon, Oct 2, 2023 at 10:16 PM Xiao Li 
 wrote:
 >
 > Hi all,
 >
 > The Spark PMC is delighted to announce that we have voted to add
 one new committer and two new PMC members. These individuals have
 consistently contributed to the project and have clearly demonstrated their
 expertise.
 >
 > New Committer:
 > - Jiaan Geng (focusing on Spark Connect and Spark SQL)
 >
 > New PMCs:
 > - Yuanjian Li
 > - Yikun Jiang
 >
 > Please join us in extending a warm welcome to them in their new
 roles!
 >
 > Sincerely,
 > The Spark PMC

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-09 Thread Kent Yao
Congrats!

Kent


在 2023年10月7日星期六,John Zhuge  写道:

> Congratulations!
>
> On Fri, Oct 6, 2023 at 6:41 PM Yi Wu  wrote:
>
>> Congrats!
>>
>> On Sat, Oct 7, 2023 at 9:24 AM XiDuo You  wrote:
>>
>>> Congratulations!
>>>
>>> Prashant Sharma  于2023年10月6日周五 00:26写道:
>>> >
>>> > Congratulations 
>>> >
>>> > On Wed, 4 Oct, 2023, 8:52 pm huaxin gao, 
>>> wrote:
>>> >>
>>> >> Congratulations!
>>> >>
>>> >> On Wed, Oct 4, 2023 at 7:39 AM Chao Sun  wrote:
>>> >>>
>>> >>> Congratulations!
>>> >>>
>>> >>> On Wed, Oct 4, 2023 at 5:11 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> 
>>>  Congrats!
>>> 
>>>  2023년 10월 4일 (수) 오후 5:04, yangjie01 님이
>>> 작성:
>>> >
>>> > Congratulations!
>>> >
>>> >
>>> >
>>> > Jie Yang
>>> >
>>> >
>>> >
>>> > 发件人: Dongjoon Hyun 
>>> > 日期: 2023年10月4日 星期三 13:04
>>> > 收件人: Hyukjin Kwon 
>>> > 抄送: Hussein Awala , Rui Wang <
>>> amaliu...@apache.org>, Gengliang Wang , Xiao Li <
>>> gatorsm...@gmail.com>, "dev@spark.apache.org" 
>>> > 主题: Re: Welcome to Our New Apache Spark Committer and PMCs
>>> >
>>> >
>>> >
>>> > Congratulations!
>>> >
>>> >
>>> >
>>> > Dongjoon.
>>> >
>>> >
>>> >
>>> > On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Woohoo!
>>> >
>>> >
>>> >
>>> > On Tue, 3 Oct 2023 at 22:47, Hussein Awala 
>>> wrote:
>>> >
>>> > Congrats to all of you!
>>> >
>>> >
>>> >
>>> > On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>>> >
>>> > Congratulations! Well deserved!
>>> >
>>> >
>>> >
>>> > -Rui
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang 
>>> wrote:
>>> >
>>> > Congratulations to all! Well deserved!
>>> >
>>> >
>>> >
>>> > On Mon, Oct 2, 2023 at 10:16 PM Xiao Li 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > The Spark PMC is delighted to announce that we have voted to add
>>> one new committer and two new PMC members. These individuals have
>>> consistently contributed to the project and have clearly demonstrated their
>>> expertise.
>>> >
>>> > New Committer:
>>> > - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>>> >
>>> > New PMCs:
>>> > - Yuanjian Li
>>> > - Yikun Jiang
>>> >
>>> > Please join us in extending a warm welcome to them in their new
>>> roles!
>>> >
>>> > Sincerely,
>>> > The Spark PMC
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>