Parquet files in streaming mode

2021-12-27 Thread Mathieu D
Hello,

We’re trying to use a Parquet file sink to output files in s3.

When running in Streaming mode, it seems that parquet files are flushed and
rolled at each checkpoint. The result is a crazy high number of very small
parquet files which completely defeats the purpose of that format.


Is there a way to build larger output parquet files? Or is it only at the
price of having a very large checkpointing interval?

Thanks for your insights.

Mathieu


hook a callback on checkpointing failure.

2021-10-14 Thread Mathieu D
Hey there,

We have some instabilities around checkpointing, that we don't quite
understand.
In general, as soon as a checkpoint fails, our cluster does not recover
back to a proper state.
But to better understand the mechanism, we'd like to be notified as soon as
this happens, so we can jump on our console and try to understand the
problem.

So, in my mind, we'd simply send a slack notif to some ops, as soon as a
checkpoint fails.

Is there a way to register a callback in the checkpointing system, and get
called as soon one fails ?

[FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3,
rocksdbbackend]

Thanks.
Mathieu


Re: proper way to manage watermarks with messages combining multiple timestamps

2021-04-18 Thread Mathieu D
Hi,

I can't change the way devices send their data. We are constrained in the
messages sent per day per device.

To illustrate my question:
- at 9:08 a message is emitted. It packs together several measures:
- measure m1 taken at 8:52
- measure m2 taken at 9:07

m1 must go in the 8:00-9:00 aggregation
m2 in the 9:00-10:00 aggregation

What's the proper way to set the watermarks in such a case ?

Thanks for your insights !

Mathieu

Le sam. 17 avr. 2021 à 07:05, Lasse Nedergaard <
lassenedergaardfl...@gmail.com> a écrit :

> Hi
>
> One thing to remember is that Flinks watermark is global this mean it’s
> shared between all keys (in your case ioT Devices) so the first requirement
> your have is to ensure the timestamp is aligned or almost aligned between
> yours IoT devices if not Flink’s watermark is hard to use.
>
> Med venlig hilsen / Best regards
> Lasse Nedergaard
>
>
> > Den 16. apr. 2021 kl. 18.29 skrev Mathieu D :
> >
> > 
> > Hello,
> >
> > I'm totally new to Flink, and I'd like to make sure I understand things
> properly around watermarks.
> >
> > We're processing messages from iot devices.
> > Those messages have a timestamp, and we have a first phase of processing
> based on this timestamp. So far so good.
> >
> > These messages actually "pack" together several measures taken at
> different times, typically going from ~15mn back in the past from the
> message timestamp, to a few seconds back.
> >
> > So at a point in the processing, I'll flatMap the message stream into a
> stream of measures, and I'll first need to reaffect the event time. I guess
> I can do it using a TimestampAssigner, correct ?
> >
> > The flatmapped stream will now mix together a large range of event-times
> (so, a span of 15mn). What should I do regarding the watermark ? Should I
> regenerate one ? and how ?
> >
> > My measures will go through windowed aggregations. Should I use the
> allowedLateness param to manage that properly ?
> > (Note: I'm ok with windows firing several times with updated content, if
> that matters. Our downstream usage is made for that.)
> >
> > Thanks a lot for your insights and pointers :-)
> >
> > Mathieu
> >
> >
> >
>


proper way to manage watermarks with messages combining multiple timestamps

2021-04-16 Thread Mathieu D
Hello,

I'm totally new to Flink, and I'd like to make sure I understand things
properly around watermarks.

We're processing messages from iot devices.
Those messages have a timestamp, and we have a first phase of processing
based on this timestamp. So far so good.

These messages actually "pack" together several measures taken at different
times, typically going from ~15mn back in the past from the message
timestamp, to a few seconds back.

So at a point in the processing, I'll flatMap the message stream into a
stream of measures, and I'll first need to reaffect the event time. I guess
I can do it using a TimestampAssigner, correct ?

The flatmapped stream will now mix together a large range of event-times
(so, a span of 15mn). What should I do regarding the watermark ? Should I
regenerate one ? and how ?

My measures will go through windowed aggregations. Should I use the
allowedLateness param to manage that properly ?
(Note: I'm ok with windows firing several times with updated content, if
that matters. Our downstream usage is made for that.)

Thanks a lot for your insights and pointers :-)

Mathieu