Re: Flink and swapping question

Flavio Pompermaier Wed, 07 Jun 2017 02:57:06 -0700

I forgot to mention that my jobs are all batch (at the moment).

Do you think that this problem could be related to


   - http://www.evanjones.ca/java-bytebuffer-leak.html#comment-3240054880
   - and http://www.evanjones.ca/java-native-leak-bug.html

Kurt told me also to add "env.java.opts: -Dio.netty.
recycler.maxCapacity.default=1" .

Best,
Flavio

On Tue, Jun 6, 2017 at 7:42 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Hi Stephan,
> I also think that the error is more related to netty.
> The only suspicious library I use are parquet or thrift.
> I'm not using off-heap memory.
> What do you mean for "crazy high number of concurrent network
> shuffles"?how can I count that?
> We're using java 8.
>
> Thanks a lot,
> Flavio
>
>
>
> On 6 Jun 2017 7:13 pm, "Stephan Ewen" <se...@apache.org> wrote:
>
> Hi!
>
> I would actually be surprised if this is an issue in core Flink.
>
>   - The MaxDirectMemory parameter is pretty meaningless, it really is a
> max and does not have an impact on how much is actually allocated.
>
>   - In most cases we had reported so far, the leak was in a library that
> was used in the user code
>
>   - If you do not use offheap memory in Flink, then there are few other
> culprits that can cause high virtual memory consumption:
>       - Netty, if you bumped the Netty version in a custom build
>       - Flink's Netty, if the job has a crazy high number of concurrent
> network shuffles (we are talking 1000s here)
>       - Some old Java versions have I/O memory leaks (I think some older
> Java 6 and Java 7 versions were affected)
>
>
> To diagnose that better:
>
>   - Are these batch or streaming jobs?
>   - If it is streaming, which state backend are you using?
>
> Stephan
>
>
> On Tue, Jun 6, 2017 at 12:00 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> can you post the all memory configuration parameters of your workers?
>> Did you investigate which whether the direct or heap memory grew?
>>
>> Thanks, Fabian
>>
>> 2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>
>>> Hi to all,
>>> I'm still trying to understand what's going on our production Flink
>>> cluster.
>>> The facts are:
>>>
>>> 1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
>>> 2. On a specific  job we have, without limiting the direct memory to 5g,
>>> the TM gets killed by the OS almost immediately because the memory required
>>> by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to
>>> be less affected by the problem )
>>> 3. Although the memory consumption is much better this way, the Flink TM
>>> memory continuously grow job after job (of this problematic type): we set
>>> TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is
>>> that possible?
>>>
>>> My fear is that there's some annoying memory leak / bad memory
>>> allocation in the Flink network level, but I can't have any evidence of
>>> this (except the fact that the vm which doesn't have a hdfs datanode
>>> underneath the Flink TM is the one with the biggest TM virtual memory
>>> consumption).
>>>
>>> Thanks for the help ,
>>> Flavio
>>>
>>> On 29 May 2017 15:37, "Nico Kruber" <n...@data-artisans.com> wrote:
>>>
>>>> FYI: taskmanager.sh sets this parameter but also states the following:
>>>>
>>>>   # Long.MAX_VALUE in TB: This is an upper bound, much less direct
>>>> memory will
>>>> be used
>>>>   TM_MAX_OFFHEAP_SIZE="8388607T"
>>>>
>>>>
>>>> Nico
>>>>
>>>> On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
>>>> > Hi Flavio,
>>>> >
>>>> > Is this running on YARN or bare metal? Did you manage to find out
>>>> where this
>>>> > insanely large parameter is coming from?
>>>> >
>>>> > Best,
>>>> > Aljoscha
>>>> >
>>>> > > On 25. May 2017, at 19:36, Flavio Pompermaier <pomperma...@okkam.it
>>>> >
>>>> > > wrote:
>>>> > >
>>>> > > Hi to all,
>>>> > > I think we found the root cause of all the problems. Looking ad
>>>> dmesg
>>>> > > there was a "crazy" total-vm size associated to the OOM error, a
>>>> LOT much
>>>> > > bigger than the TaskManager's available memory. In our case, the TM
>>>> had a
>>>> > > max heap of 14 GB while the dmsg error was reporting a required
>>>> amount of
>>>> > > memory in the order of 60 GB!
>>>> > >
>>>> > > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
>>>> > > sacrifice child [ 5331.992619] Killed process 24221 (java)
>>>> > > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB,
>>>> shmem-rss:0kB
>>>> > >
>>>> > > That wasn't definitively possible usin an ordinary JVM (and our TM
>>>> was
>>>> > > running without off-heap settings) so we've looked at the
>>>> parameters used
>>>> > > to run the TM JVM and indeed there was a reall huge amount of memory
>>>> > > given to MaxDirectMemorySize. With my big surprise Flink runs a TM
>>>> with
>>>> > > this parameter set to 8.388.607T..does it make any sense?? Is it
>>>> > > documented anywhere the importance of this parameter (and why it is
>>>> used
>>>> > > in non off-heap mode as well)? Is it related to network buffers? It
>>>> > > should also be documented that this parameter should be added to
>>>> the TM
>>>> > > heap when reserving memory to Flin (IMHO).
>>>> > >
>>>> > > I hope that this painful sessions of Flink troubleshooting could be
>>>> an
>>>> > > added value sooner or later..
>>>> > >
>>>> > > Best,
>>>> > > Flavio
>>>> > >
>>>> > > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it
>>>> > > <mailto:pomperma...@okkam.it>> wrote: I can confirm that after
>>>> giving
>>>> > > less memory to the Flink TM the job was able to run successfully.
>>>> After
>>>> > > almost 2 weeks of pain, we summarize here our experience with Fink
>>>> in
>>>> > > virtualized environments (such as VMWare ESXi): Disable the
>>>> > > virtualization "feature" that transfer a VM from a (heavy loaded)
>>>> > > physical machine to another one (to balance the resource
>>>> consumption)
>>>> > > Check dmesg when a TM dies without logging anything (usually it
>>>> goes OOM
>>>> > > and the OS kills it but there you can find the log of this thing)
>>>> CentOS
>>>> > > 7 on ESXi seems to start swapping VERY early (in my case I see the
>>>> OS
>>>> > > starting swapping also if there are 12 out of 32 GB of free memory)!
>>>> > > We're still investigating how this behavior could be fixed: the
>>>> problem
>>>> > > is that it's better not to disable swapping because otherwise VMWare
>>>> > > could start ballooning (that is definitely worse...).
>>>> > >
>>>> > > I hope this tips could save someone else's day..
>>>> > >
>>>> > > Best,
>>>> > > Flavio
>>>> > >
>>>> > > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <
>>>> pomperma...@okkam.it
>>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, you were right!
>>>> After
>>>> > > typing dmsg I found "Out of memory: Kill process 13574 (java)".
>>>> This is
>>>> > > really strange because the JVM of the TM is very calm.
>>>> > > Moreover, there are 7 GB of memory available (out of 32) but
>>>> somehow the
>>>> > > OS decides to start swapping and, when it runs out of available swap
>>>> > > memory, the OS decides to kill the Flink TM :(
>>>> > >
>>>> > > Any idea of what's going on here?
>>>> > >
>>>> > > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <
>>>> pomperma...@okkam.it
>>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg,
>>>> > > I carefully monitored all TM memory with jstat -gcutil and there'no
>>>> full
>>>> > > gc, only .>
>>>> > > The initial situation on the dying TM is:
>>>> > >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC
>>>> FGCT
>>>> > >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
>>>> > >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197
>>>> 2.617
>>>> > >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17
>>>> 234
>>>> > >    2.730     1    0.255    2.986>
>>>> > > After about 10 hours of processing is:
>>>> > >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
>>>> 0.255
>>>> > >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
>>>>  1
>>>> > >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519
>>>>  33.011
>>>> > >      1    0.255   33.267>
>>>> > > So I don't think thta OOM could be an option.
>>>> > >
>>>> > > However, the cluster is running on ESXi vSphere VMs and we already
>>>> > > experienced unexpected crash of jobs because of ESXi moving a
>>>> > > heavy-loaded VM to another (less loaded) physical machine..I
>>>> would't be
>>>> > > surprised if swapping is also handled somehow differently.. Looking
>>>> at
>>>> > > Cloudera widgets I see that the crash is usually preceded by an
>>>> intense
>>>> > > cpu_iowait period. I fear that Flink unsafe access to memory could
>>>> be a
>>>> > > problem in those scenarios. Am I wrong?
>>>> > >
>>>> > > Any insight or debugging technique is  greatly appreciated.
>>>> > > Best,
>>>> > > Flavio
>>>> > >
>>>> > >
>>>> > > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com
>>>> > > <mailto:c...@greghogan.com>> wrote: Hi Flavio,
>>>> > >
>>>> > > Flink handles interrupts so the only silent killer I am aware of is
>>>> > > Linux's OOM killer. Are you seeing such a message in dmesg?
>>>> > >
>>>> > > Greg
>>>> > >
>>>> > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it
>>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi to all,
>>>> > > I'd like to know whether memory swapping could cause a taskmanager
>>>> crash.
>>>> > > In my cluster of virtual machines 'm seeing this strange behavior
>>>> in my
>>>> > > Flink cluster: sometimes, if memory get swapped the taskmanager (on
>>>> that
>>>> > > machine) dies unexpectedly without any log about the error.
>>>> > >
>>>> > > Is that possible or not?
>>>> > >
>>>> > > Best,
>>>> > > Flavio
>>>>
>>>>
>>
>

Re: Flink and swapping question

Reply via email to