Hi everyone!!

On our pipeline we're experiencing OOM errors. It's been a while since
we've been trying to nail them down but without any luck.

Our pipeline:
1. Reads messages from PubSub of different types (about 50 different types).
2. Outputs KV elements being the key the type and the value the element
message itself (JSON)
3. Applies windowing (fixed windows of one our. With early and late firings
after one minute after processing the first element in pane). Two days
allowed lateness and discarding fired panes.
4. Buffers the elements (using stateful and timely processing). We buffer
the elements for 15 minutes or until it reaches a maximum size of 16Mb.
This step's objective is to avoid window's panes grow too big.
5. Writes the outputted elements to files in GCS using dynamic routing.
Being the route: type/window/buffer_index-paneInfo.

Our fist step was without buffering and just windows with early and late
firings on 15 minutes, but. guessing that OOMs were because of panes
growing too big we built that buffering step to trigger on size as well.

The full trace we're seeing can be found here: https://pastebin.com/0vfE6pUg

There's an annoying thing we've experienced and is that, the smaller the
buffer size, the more OOM errors we see which was a bit disappointing...

Can anyone please give us any hint?

Thanks in advance!

Reply via email to