The code for doing dynamic writes tries to limit its memory usage, so it wouldn't be my first place to look (but I wouldn't rule it out). Are you using workers with a large number of cores or threads per worker?
In MAT or YourKit, the most useful tool is the Dominator Tree. Can you paste a screenshot of the dominator tree expanded to some reasonable depth? On Mon, Jan 29, 2018, 10:45 AM Carlos Alonso <[email protected]> wrote: > Hi Eugene! > > Thanks for your comments. Yes, we've downloaded a couple of dumps, but > TBH, couldn't understand anything (using the Eclipse MAT), I was wondering > if the trace and the fact that "the smaller the buffer, the more OOM > errors" could give any of you a hint as I think it may be on the writing > part... > > Do you know how the dynamic writes are distributed on workers? Based on > the full path? > > Regards > > On Mon, Jan 29, 2018 at 7:38 PM Eugene Kirpichov <[email protected]> > wrote: > >> Hi, I assume you're using the Dataflow runner. Have you tried using the >> OOM debugging flags at >> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L169-L193 >> ? >> >> On Mon, Jan 29, 2018 at 2:05 AM Carlos Alonso <[email protected]> >> wrote: >> >>> Hi everyone!! >>> >>> On our pipeline we're experiencing OOM errors. It's been a while since >>> we've been trying to nail them down but without any luck. >>> >>> Our pipeline: >>> 1. Reads messages from PubSub of different types (about 50 different >>> types). >>> 2. Outputs KV elements being the key the type and the value the element >>> message itself (JSON) >>> 3. Applies windowing (fixed windows of one our. With early and late >>> firings after one minute after processing the first element in pane). Two >>> days allowed lateness and discarding fired panes. >>> 4. Buffers the elements (using stateful and timely processing). We >>> buffer the elements for 15 minutes or until it reaches a maximum size of >>> 16Mb. This step's objective is to avoid window's panes grow too big. >>> 5. Writes the outputted elements to files in GCS using dynamic routing. >>> Being the route: type/window/buffer_index-paneInfo. >>> >>> Our fist step was without buffering and just windows with early and late >>> firings on 15 minutes, but. guessing that OOMs were because of panes >>> growing too big we built that buffering step to trigger on size as well. >>> >>> The full trace we're seeing can be found here: >>> https://pastebin.com/0vfE6pUg >>> >>> There's an annoying thing we've experienced and is that, the smaller the >>> buffer size, the more OOM errors we see which was a bit disappointing... >>> >>> Can anyone please give us any hint? >>> >>> Thanks in advance! >>> >>
