I forgot to mention that my jobs are all batch (at the moment). Do you think that this problem could be related to
- http://www.evanjones.ca/java-bytebuffer-leak.html#comment-3240054880 - and http://www.evanjones.ca/java-native-leak-bug.html Kurt told me also to add "env.java.opts: -Dio.netty. recycler.maxCapacity.default=1" . Best, Flavio On Tue, Jun 6, 2017 at 7:42 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > Hi Stephan, > I also think that the error is more related to netty. > The only suspicious library I use are parquet or thrift. > I'm not using off-heap memory. > What do you mean for "crazy high number of concurrent network > shuffles"?how can I count that? > We're using java 8. > > Thanks a lot, > Flavio > > > > On 6 Jun 2017 7:13 pm, "Stephan Ewen" <se...@apache.org> wrote: > > Hi! > > I would actually be surprised if this is an issue in core Flink. > > - The MaxDirectMemory parameter is pretty meaningless, it really is a > max and does not have an impact on how much is actually allocated. > > - In most cases we had reported so far, the leak was in a library that > was used in the user code > > - If you do not use offheap memory in Flink, then there are few other > culprits that can cause high virtual memory consumption: > - Netty, if you bumped the Netty version in a custom build > - Flink's Netty, if the job has a crazy high number of concurrent > network shuffles (we are talking 1000s here) > - Some old Java versions have I/O memory leaks (I think some older > Java 6 and Java 7 versions were affected) > > > To diagnose that better: > > - Are these batch or streaming jobs? > - If it is streaming, which state backend are you using? > > Stephan > > > On Tue, Jun 6, 2017 at 12:00 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Flavio, >> >> can you post the all memory configuration parameters of your workers? >> Did you investigate which whether the direct or heap memory grew? >> >> Thanks, Fabian >> >> 2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >> >>> Hi to all, >>> I'm still trying to understand what's going on our production Flink >>> cluster. >>> The facts are: >>> >>> 1. The Flink cluster runs on 5 VMWare VMs managed by ESXi >>> 2. On a specific job we have, without limiting the direct memory to 5g, >>> the TM gets killed by the OS almost immediately because the memory required >>> by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to >>> be less affected by the problem ) >>> 3. Although the memory consumption is much better this way, the Flink TM >>> memory continuously grow job after job (of this problematic type): we set >>> TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is >>> that possible? >>> >>> My fear is that there's some annoying memory leak / bad memory >>> allocation in the Flink network level, but I can't have any evidence of >>> this (except the fact that the vm which doesn't have a hdfs datanode >>> underneath the Flink TM is the one with the biggest TM virtual memory >>> consumption). >>> >>> Thanks for the help , >>> Flavio >>> >>> On 29 May 2017 15:37, "Nico Kruber" <n...@data-artisans.com> wrote: >>> >>>> FYI: taskmanager.sh sets this parameter but also states the following: >>>> >>>> # Long.MAX_VALUE in TB: This is an upper bound, much less direct >>>> memory will >>>> be used >>>> TM_MAX_OFFHEAP_SIZE="8388607T" >>>> >>>> >>>> Nico >>>> >>>> On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote: >>>> > Hi Flavio, >>>> > >>>> > Is this running on YARN or bare metal? Did you manage to find out >>>> where this >>>> > insanely large parameter is coming from? >>>> > >>>> > Best, >>>> > Aljoscha >>>> > >>>> > > On 25. May 2017, at 19:36, Flavio Pompermaier <pomperma...@okkam.it >>>> > >>>> > > wrote: >>>> > > >>>> > > Hi to all, >>>> > > I think we found the root cause of all the problems. Looking ad >>>> dmesg >>>> > > there was a "crazy" total-vm size associated to the OOM error, a >>>> LOT much >>>> > > bigger than the TaskManager's available memory. In our case, the TM >>>> had a >>>> > > max heap of 14 GB while the dmsg error was reporting a required >>>> amount of >>>> > > memory in the order of 60 GB! >>>> > > >>>> > > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or >>>> > > sacrifice child [ 5331.992619] Killed process 24221 (java) >>>> > > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, >>>> shmem-rss:0kB >>>> > > >>>> > > That wasn't definitively possible usin an ordinary JVM (and our TM >>>> was >>>> > > running without off-heap settings) so we've looked at the >>>> parameters used >>>> > > to run the TM JVM and indeed there was a reall huge amount of memory >>>> > > given to MaxDirectMemorySize. With my big surprise Flink runs a TM >>>> with >>>> > > this parameter set to 8.388.607T..does it make any sense?? Is it >>>> > > documented anywhere the importance of this parameter (and why it is >>>> used >>>> > > in non off-heap mode as well)? Is it related to network buffers? It >>>> > > should also be documented that this parameter should be added to >>>> the TM >>>> > > heap when reserving memory to Flin (IMHO). >>>> > > >>>> > > I hope that this painful sessions of Flink troubleshooting could be >>>> an >>>> > > added value sooner or later.. >>>> > > >>>> > > Best, >>>> > > Flavio >>>> > > >>>> > > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier < >>>> pomperma...@okkam.it >>>> > > <mailto:pomperma...@okkam.it>> wrote: I can confirm that after >>>> giving >>>> > > less memory to the Flink TM the job was able to run successfully. >>>> After >>>> > > almost 2 weeks of pain, we summarize here our experience with Fink >>>> in >>>> > > virtualized environments (such as VMWare ESXi): Disable the >>>> > > virtualization "feature" that transfer a VM from a (heavy loaded) >>>> > > physical machine to another one (to balance the resource >>>> consumption) >>>> > > Check dmesg when a TM dies without logging anything (usually it >>>> goes OOM >>>> > > and the OS kills it but there you can find the log of this thing) >>>> CentOS >>>> > > 7 on ESXi seems to start swapping VERY early (in my case I see the >>>> OS >>>> > > starting swapping also if there are 12 out of 32 GB of free memory)! >>>> > > We're still investigating how this behavior could be fixed: the >>>> problem >>>> > > is that it's better not to disable swapping because otherwise VMWare >>>> > > could start ballooning (that is definitely worse...). >>>> > > >>>> > > I hope this tips could save someone else's day.. >>>> > > >>>> > > Best, >>>> > > Flavio >>>> > > >>>> > > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier < >>>> pomperma...@okkam.it >>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, you were right! >>>> After >>>> > > typing dmsg I found "Out of memory: Kill process 13574 (java)". >>>> This is >>>> > > really strange because the JVM of the TM is very calm. >>>> > > Moreover, there are 7 GB of memory available (out of 32) but >>>> somehow the >>>> > > OS decides to start swapping and, when it runs out of available swap >>>> > > memory, the OS decides to kill the Flink TM :( >>>> > > >>>> > > Any idea of what's going on here? >>>> > > >>>> > > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier < >>>> pomperma...@okkam.it >>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, >>>> > > I carefully monitored all TM memory with jstat -gcutil and there'no >>>> full >>>> > > gc, only .> >>>> > > The initial situation on the dying TM is: >>>> > > S0 S1 E O M CCS YGC YGCT FGC >>>> FGCT >>>> > > GCT 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 >>>> > > 0.255 2.763 0.00 100.00 90.14 88.80 98.67 97.17 197 >>>> 2.617 >>>> > > 1 0.255 2.873 0.00 100.00 27.00 88.82 98.75 97.17 >>>> 234 >>>> > > 2.730 1 0.255 2.986> >>>> > > After about 10 hours of processing is: >>>> > > 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 >>>> 0.255 >>>> > > 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 >>>> 1 >>>> > > 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 >>>> 33.011 >>>> > > 1 0.255 33.267> >>>> > > So I don't think thta OOM could be an option. >>>> > > >>>> > > However, the cluster is running on ESXi vSphere VMs and we already >>>> > > experienced unexpected crash of jobs because of ESXi moving a >>>> > > heavy-loaded VM to another (less loaded) physical machine..I >>>> would't be >>>> > > surprised if swapping is also handled somehow differently.. Looking >>>> at >>>> > > Cloudera widgets I see that the crash is usually preceded by an >>>> intense >>>> > > cpu_iowait period. I fear that Flink unsafe access to memory could >>>> be a >>>> > > problem in those scenarios. Am I wrong? >>>> > > >>>> > > Any insight or debugging technique is greatly appreciated. >>>> > > Best, >>>> > > Flavio >>>> > > >>>> > > >>>> > > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com >>>> > > <mailto:c...@greghogan.com>> wrote: Hi Flavio, >>>> > > >>>> > > Flink handles interrupts so the only silent killer I am aware of is >>>> > > Linux's OOM killer. Are you seeing such a message in dmesg? >>>> > > >>>> > > Greg >>>> > > >>>> > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier < >>>> pomperma...@okkam.it >>>> > > <mailto:pomperma...@okkam.it>> wrote: Hi to all, >>>> > > I'd like to know whether memory swapping could cause a taskmanager >>>> crash. >>>> > > In my cluster of virtual machines 'm seeing this strange behavior >>>> in my >>>> > > Flink cluster: sometimes, if memory get swapped the taskmanager (on >>>> that >>>> > > machine) dies unexpectedly without any log about the error. >>>> > > >>>> > > Is that possible or not? >>>> > > >>>> > > Best, >>>> > > Flavio >>>> >>>> >> >