The code for doing dynamic writes tries to limit its memory usage, so it
wouldn't be my first place to look (but I wouldn't rule it out). Are you
using workers with a large number of cores or threads per worker?

In MAT or YourKit, the most useful tool is the Dominator Tree. Can you
paste a screenshot of the dominator tree expanded to some reasonable depth?

On Mon, Jan 29, 2018, 10:45 AM Carlos Alonso <[email protected]> wrote:

> Hi Eugene!
>
> Thanks for your comments. Yes, we've downloaded a couple of dumps, but
> TBH, couldn't understand anything (using the Eclipse MAT), I was wondering
> if the trace and the fact that "the smaller the buffer, the more OOM
> errors" could give any of you a hint as I think it may be on the writing
> part...
>
> Do you know how the dynamic writes are distributed on workers? Based on
> the full path?
>
> Regards
>
> On Mon, Jan 29, 2018 at 7:38 PM Eugene Kirpichov <[email protected]>
> wrote:
>
>> Hi, I assume you're using the Dataflow runner. Have you tried using the
>> OOM debugging flags at
>> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L169-L193
>>  ?
>>
>> On Mon, Jan 29, 2018 at 2:05 AM Carlos Alonso <[email protected]>
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> On our pipeline we're experiencing OOM errors. It's been a while since
>>> we've been trying to nail them down but without any luck.
>>>
>>> Our pipeline:
>>> 1. Reads messages from PubSub of different types (about 50 different
>>> types).
>>> 2. Outputs KV elements being the key the type and the value the element
>>> message itself (JSON)
>>> 3. Applies windowing (fixed windows of one our. With early and late
>>> firings after one minute after processing the first element in pane). Two
>>> days allowed lateness and discarding fired panes.
>>> 4. Buffers the elements (using stateful and timely processing). We
>>> buffer the elements for 15 minutes or until it reaches a maximum size of
>>> 16Mb. This step's objective is to avoid window's panes grow too big.
>>> 5. Writes the outputted elements to files in GCS using dynamic routing.
>>> Being the route: type/window/buffer_index-paneInfo.
>>>
>>> Our fist step was without buffering and just windows with early and late
>>> firings on 15 minutes, but. guessing that OOMs were because of panes
>>> growing too big we built that buffering step to trigger on size as well.
>>>
>>> The full trace we're seeing can be found here:
>>> https://pastebin.com/0vfE6pUg
>>>
>>> There's an annoying thing we've experienced and is that, the smaller the
>>> buffer size, the more OOM errors we see which was a bit disappointing...
>>>
>>> Can anyone please give us any hint?
>>>
>>> Thanks in advance!
>>>
>>

Reply via email to