Hi, Paul is right. Which and how much data is stored in state for a window depends on the type of the function that is applied on the windows:
- ReduceFunction: Only the reduced value is stored - AggregateFunction: Only the accumulator value is stored - WindowFunction or ProcessWindowFunction: All original records are stored. So in Anna's jobs, each window only stores a single value. Hence, the state size is independent of the size of the window (unless, the reduced value collects values of all input records, e.g., in a list or set). Best, Fabian 2018-08-02 10:29 GMT+02:00 vino yang <yanghua1...@gmail.com>: > Hi Paul, > > Yes, I am talking about the normal case, Flink must store the data in the > window as a state to prevent failure. > In some scenarios your understanding is also correct, and flink uses the > window pane to optimize window calculations. > So, if your scene is in optimized mode, ignore this. > > Thanks, vino. > > 2018-08-02 16:11 GMT+08:00 Paul Lam <paullin3...@gmail.com>: > >> >> >> > 在 2018年7月31日,15:47,vino yang <yanghua1...@gmail.com> 写道: >> > >> > Hi anna, >> > >> > 1. The srcstream is a very high volume stream and the window size is 2 >> weeks and 4 weeks. Is the window size a problem? In this case, I think it >> is not a problem because I am using reduce which stores only 1 value per >> window. Is that right? >> > >> > >> Window Size is based on your business needs settings. However, if >> the window size is too large, the status of the job will be large, which >> will result in a longer recovery failure. You need to be aware of this. One >> value per window is just a value calculated by the window. It caches all >> data for the period of time before the window is triggered. >> > >> > 2. I am having 2 output operations one with 2 weeks window and the >> other with 4 weeks window. Are they executed in parallel or in sequence? >> > >> > >> These two windows are calculated in parallel. >> > >> > 3. When I have multiple output operations like in this case should I >> break it into 2 different jobs ? >> > >> > >> Both modes are ok. When there is only one job, the two windows will >> share the source stream, but this will result in a larger state of the job >> and a slower recovery. When split into two jobs, there will be two >> consumptions of kafka, but the two windows are independent in both jobs. >> > >> > 4. Can I run multiple jobs on the same cluster? >> > >> > >> For Standalone cluster mode or Yarn Flink Session mode, etc., there >> is no problem. For Flink on yarn single job mode, a cluster can usually >> only run one job, which is the recommended mode. >> > >> > Thanks, vino. >> > >> > 2018-07-31 15:11 GMT+08:00 anna stax <annasta...@gmail.com>: >> > Hi all, >> > >> > I am not sure when I should go for multiple jobs or have 1 job with all >> the sources and sinks. Following is my code. >> > >> > val env = StreamExecutionEnvironment.getExecutionEnvironment >> > ....... >> > // create a Kafka source >> > val srcstream = env.addSource(consumer) >> > >> > srcstream >> > .keyBy(0) >> > .window(ProcessingTimeSessionWindows.withGap(Time.days(14))) >> > .reduce ... >> > .map ... >> > .addSink ... >> > >> > srcstream >> > .keyBy(0) >> > .window(ProcessingTimeSessionWindows.withGap(Time.days(28))) >> > .reduce ... >> > .map ... >> > .addSink ... >> > >> > env.execute("Job1") >> > >> > My questions >> > >> > 1. The srcstream is a very high volume stream and the window size is 2 >> weeks and 4 weeks. Is the window size a problem? In this case, I think it >> is not a problem because I am using reduce which stores only 1 value per >> window. Is that right? >> > >> > 2. I am having 2 output operations one with 2 weeks window and the >> other with 4 weeks window. Are they executed in parallel or in sequence? >> > >> > 3. When I have multiple output operations like in this case should I >> break it into 2 different jobs ? >> > >> > 4. Can I run multiple jobs on the same cluster? >> > >> > Thanks >> > >> > >> > >> >> Hi yang, >> >> I’m a bit confused about the window data cache that you mentioned. >> >> > It caches all data for the period of time before the window is >> triggered. >> >> In my understanding, window functions process elements incrementally >> unless the low level API ProcessWindowFunction was used, so caching data >> should not be required in most scenarios. Would you mind giving more >> details of the window caching design? And please correct me if I’m wrong. >> Thanks a lot. >> >> Best regards, >> Paul Lam > > >