Parquet files in streaming mode
Hello, We’re trying to use a Parquet file sink to output files in s3. When running in Streaming mode, it seems that parquet files are flushed and rolled at each checkpoint. The result is a crazy high number of very small parquet files which completely defeats the purpose of that format. Is there a way to build larger output parquet files? Or is it only at the price of having a very large checkpointing interval? Thanks for your insights. Mathieu
hook a callback on checkpointing failure.
Hey there, We have some instabilities around checkpointing, that we don't quite understand. In general, as soon as a checkpoint fails, our cluster does not recover back to a proper state. But to better understand the mechanism, we'd like to be notified as soon as this happens, so we can jump on our console and try to understand the problem. So, in my mind, we'd simply send a slack notif to some ops, as soon as a checkpoint fails. Is there a way to register a callback in the checkpointing system, and get called as soon one fails ? [FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3, rocksdbbackend] Thanks. Mathieu
Re: proper way to manage watermarks with messages combining multiple timestamps
Hi, I can't change the way devices send their data. We are constrained in the messages sent per day per device. To illustrate my question: - at 9:08 a message is emitted. It packs together several measures: - measure m1 taken at 8:52 - measure m2 taken at 9:07 m1 must go in the 8:00-9:00 aggregation m2 in the 9:00-10:00 aggregation What's the proper way to set the watermarks in such a case ? Thanks for your insights ! Mathieu Le sam. 17 avr. 2021 à 07:05, Lasse Nedergaard < lassenedergaardfl...@gmail.com> a écrit : > Hi > > One thing to remember is that Flinks watermark is global this mean it’s > shared between all keys (in your case ioT Devices) so the first requirement > your have is to ensure the timestamp is aligned or almost aligned between > yours IoT devices if not Flink’s watermark is hard to use. > > Med venlig hilsen / Best regards > Lasse Nedergaard > > > > Den 16. apr. 2021 kl. 18.29 skrev Mathieu D : > > > > > > Hello, > > > > I'm totally new to Flink, and I'd like to make sure I understand things > properly around watermarks. > > > > We're processing messages from iot devices. > > Those messages have a timestamp, and we have a first phase of processing > based on this timestamp. So far so good. > > > > These messages actually "pack" together several measures taken at > different times, typically going from ~15mn back in the past from the > message timestamp, to a few seconds back. > > > > So at a point in the processing, I'll flatMap the message stream into a > stream of measures, and I'll first need to reaffect the event time. I guess > I can do it using a TimestampAssigner, correct ? > > > > The flatmapped stream will now mix together a large range of event-times > (so, a span of 15mn). What should I do regarding the watermark ? Should I > regenerate one ? and how ? > > > > My measures will go through windowed aggregations. Should I use the > allowedLateness param to manage that properly ? > > (Note: I'm ok with windows firing several times with updated content, if > that matters. Our downstream usage is made for that.) > > > > Thanks a lot for your insights and pointers :-) > > > > Mathieu > > > > > > >
proper way to manage watermarks with messages combining multiple timestamps
Hello, I'm totally new to Flink, and I'd like to make sure I understand things properly around watermarks. We're processing messages from iot devices. Those messages have a timestamp, and we have a first phase of processing based on this timestamp. So far so good. These messages actually "pack" together several measures taken at different times, typically going from ~15mn back in the past from the message timestamp, to a few seconds back. So at a point in the processing, I'll flatMap the message stream into a stream of measures, and I'll first need to reaffect the event time. I guess I can do it using a TimestampAssigner, correct ? The flatmapped stream will now mix together a large range of event-times (so, a span of 15mn). What should I do regarding the watermark ? Should I regenerate one ? and how ? My measures will go through windowed aggregations. Should I use the allowedLateness param to manage that properly ? (Note: I'm ok with windows firing several times with updated content, if that matters. Our downstream usage is made for that.) Thanks a lot for your insights and pointers :-) Mathieu