Hi everyone!! On our pipeline we're experiencing OOM errors. It's been a while since we've been trying to nail them down but without any luck.
Our pipeline: 1. Reads messages from PubSub of different types (about 50 different types). 2. Outputs KV elements being the key the type and the value the element message itself (JSON) 3. Applies windowing (fixed windows of one our. With early and late firings after one minute after processing the first element in pane). Two days allowed lateness and discarding fired panes. 4. Buffers the elements (using stateful and timely processing). We buffer the elements for 15 minutes or until it reaches a maximum size of 16Mb. This step's objective is to avoid window's panes grow too big. 5. Writes the outputted elements to files in GCS using dynamic routing. Being the route: type/window/buffer_index-paneInfo. Our fist step was without buffering and just windows with early and late firings on 15 minutes, but. guessing that OOMs were because of panes growing too big we built that buffering step to trigger on size as well. The full trace we're seeing can be found here: https://pastebin.com/0vfE6pUg There's an annoying thing we've experienced and is that, the smaller the buffer size, the more OOM errors we see which was a bit disappointing... Can anyone please give us any hint? Thanks in advance!
