Re: Setting source vs sink vs window parallelism with data increase

Padarn Wilson Fri, 01 Mar 2019 23:20:36 -0800

Hi all again - following up on this I think I've identified my problem as
being something else, but would appreciate if anyone can offer advice.

After running my stream from sometime, I see that my garbage collector for
old generation starts to take a very long time:
[image: Screen Shot 2019-03-02 at 3.01.57 PM.png]
here the* purple line is young generation time*, this is ever increasing,
but grows slowly, while the *blue is old generation*.
This in itself is not a problem, but as soon as the next checkpoint is
triggered after this happens you see the following:
[image: Screen Shot 2019-03-02 at 3.02.48 PM.png]
It looks like the checkpoint hits a cap, but this is only because the
checkpoints start to timeout and fail (these are the alignment time per
operator)

I do notice that my state is growing quite larger over time, but I don't
have a good understanding of what would cause this to happen with the JVM
old generation metric, which appears to be the leading metric before a
problem is noticed. Other metrics such as network buffers also show that at
the checkpoint time things start to go haywire and the situation never
recovers.

Thanks

On Thu, Feb 28, 2019 at 5:50 PM Padarn Wilson <pad...@gmail.com> wrote:

> Hi all,
>
> I'm trying to process many records, and I have an expensive operation I'm
> trying to optimize. Simplified it is something like:
>
> Data: (key1, count, time)
>
> Source -> Map(x -> (x, newKeyList(x.key1))
>             -> Flatmap(x -> x._2.map(y => (y, x._1.count, x._1.time))
>             -> Keyby(_.key1).TublingWindow().apply..
>             -> Sink
>
> In the Map -> Flatmap, what is happening is that each key is mapping to a
> set of keys, and then this is set as the new key. This effectively increase
> the size of the stream by 16x
>
> What I am trying to figure out is how to set the parallelism of my
> operators. I see in some comments that people suggest your source, sink and
> aggregation should have different parallelism, but I'm not clear on exactly
> why, or what this means for CPU utilization.
> (see for example
> https://stackoverflow.com/questions/48317438/unable-to-achieve-high-cpu-utilization-with-flink-and-gelly
> )
>
> Also, it isn't clear to me the best way to handle this increase in data
> within the stream itself.
>
> Thanks
>

Re: Setting source vs sink vs window parallelism with data increase

Reply via email to