Re: 根据业务需求选择合适的flink state

张锴 Thu, 21 Jan 2021 02:25:00 -0800

你好，之前我用了你上诉的方法出现一个问题，我并没有用min/max，我是在procss方法里用的context.window.getStart和
context.window.getEnd作为开始和结束时间的，感觉这样也能获得最大和最小值，但是出来的数据最长停留了4分多钟，我跑的离线任务停留的时长有几个小时的都有，感觉出来的效果有问题。
下面是我的部分代码逻辑：


val ds = dataStream
  .filter(_.liveType == 1)
  .keyBy(1, 2)
  .window(EventTimeSessionWindows.withGap(Time.minutes(1)))
  .process(new myProcessWindow()).uid("process-id")

class myProcessWindow() extends
ProcessWindowFunction[CloudLiveLogOnLine, CloudliveWatcher, Tuple,
TimeWindow] {

  override def process(key: Tuple, context: Context, elements:
Iterable[CloudLiveLogOnLine], out: Collector[CloudliveWatcher]): Unit
= {
    var startTime = context.window.getStart //定义第一个元素进入窗口的开始时间
    var endTime = context.window.getEnd //定义最后一个元素进入窗口的时间

    val currentDate = DateUtil.currentDate
    val created_time = currentDate
    val modified_time = currentDate
     。。。

    val join_time: String =
DateUtil.convertTimeStamp2DateStr(startTime,
DateUtil.SECOND_DATE_FORMAT)
    val leave_time:String = DateUtil.convertTimeStamp2DateStr(endTime,
DateUtil.SECOND_DATE_FORMAT)
    val duration = (endTime - startTime) / 1000  //停留多少秒
    val duration_time = DateUtil.secondsToFormat(duration)  //停留时分秒
    out.collect(CloudliveWatcher(id, partnerId, courseId, customerId,
courseNumber, nickName, ip, device_type, net_opretor, net_type, area,
join_time, leave_time, created_time, modified_time
      , liveType, plat_form, duration, duration_time,
network_operator, role, useragent, uid, eventTime))

    CloudliveWatcher(id, partnerId, courseId, customerId,
courseNumber, nickName, ip, device_type, net_opretor, net_type, area,
join_time, leave_time, created_time, modified_time
      , liveType, plat_form, duration, duration_time,
network_operator, role, useragent, uid, eventTime)

}


这样写是否合适，如果要用min/max应该如何代入上诉逻辑当中？




赵一旦 <hinobl...@gmail.com> 于2020年12月28日周一 下午7:12写道：

> 按直播间ID和用户ID分组，使用session window，使用1min作为gap，统计key+window内的count即可，即sum(1)。
>
> 或者感觉你打点实际不一定肯定是1min、2min这种整时间点，可以统计key+window内的min/max，然后输出的时候做个减法。
>
> session window的作用就是根据连续2个元素的gap决定是否将2个元素放入同一个window中。
>
>
> 张锴 <zk357794...@gmail.com> 于2020年12月28日周一 下午5:35写道：
>
> > 能描述一下用session window的考虑吗
> >
> > Akisaya <akikevins...@gmail.com> 于2020年12月28日周一 下午5:00写道：
> >
> > > 这个可以用 session window 吧
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows
> > >
> > > news_...@163.com <news_...@163.com> 于2020年12月28日周一 下午2:15写道：
> > >
> > > > 这么做的前提是每条记录是顺序进入KAFKA的才行，但真实场景不能保证这条，有可能较早的打点记录却较晚进入kafka队列。
> > > >
> > > >
> > > >
> > > > news_...@163.com
> > > >
> > > > 发件人： 张锴
> > > > 发送时间： 2020-12-28 13:35
> > > > 收件人： user-zh
> > > > 主题： 根据业务需求选择合适的flink state
> > > > 各位大佬帮我分析下如下需求应该怎么写
> > > >
> > > > 需求说明：
> > > >
> > 公司是做直播业务方面的，现在需要实时统计用户的在线时长，来源kafka，每分钟产生直播打点记录，数据中包含eventTime字段。举个例子，用户A
> > > >
> > > >
> > >
> >
> 在1，2，3分钟一直产生打点记录，那么他的停留时长就是3分钟，第5，6分钟又来了，那他的停留时长就是2分钟，只要是连续的打点记录就合成一条记录，每个直播间每个用户同理。
> > > >
> > > > 我的想法：
> > > > 我现在的想法是按直播间ID和用户ID分组，然后process，想通过state方式来做，通过截取每条记录的event Time中的分钟数
> > > > 减去上一条的分钟数，如果他们差值等于1，说明是连续关系，则继续。如果不为1，说明断开了，直接输出这条记录，同时情况当前key的状态。
> > > >
> > > > 不知道各位大佬对我的这个想法有什么建议，或者说有更好的处理方式。
> > > >
> > > > flink 版本1.10.1
> > > >
> > >
> >
>

Re: 根据业务需求选择合适的flink state

回复