Re: Heap Problem with Checkpoints

2018-08-09 Thread Piotr Nowojski
Hi, Thanks for getting back with more information. Apparently this is a known bug of JDK since 2003 and is still not resolved: https://bugs.java.com/view_bug.do?bug_id=4872014 https://bugs.java.com/view_bug.do?bug_id=6664633

Re: Heap Problem with Checkpoints

2018-08-09 Thread Ayush Verma
Hello Piotr, I work with Fabian and have been investigating the memory leak associated with issues mentioned in this thread. I took a heap dump of our master node and noticed that there was >1gb (and growing) worth of entries in the set, /files/, in class *java.io.DeleteOnExitHook*. Almost all the

Re: Heap Problem with Checkpoints

2018-06-20 Thread Fabian Wollert
to that last one: i'm accessing S3 from one EC2 instance which has a IAM Role attached ... I'll get back to you when i have those stacktraces printed ... will have to build the project and package the custom version first, might take some time, and also some vacation is up next ... Cheers --

Re: Heap Problem with Checkpoints

2018-06-20 Thread Piotr Nowojski
Btw, side questions. Could it be, that you are accessing two different Hadoop file systems (two different schemas) or even the same one from two different users (encoded in the file system URI) within the same Flink JobMaster? If so, the answer might be this possible resource leak in Flink: http

Re: Heap Problem with Checkpoints

2018-06-20 Thread Piotr Nowojski
Hi, I was looking in this more, and I have couple of suspicions, but it’s still hard to tell which is correct. Could you for example place a breakpoint (or add a code there to print a stack trace) in org.apache.log4j.helpers.AppenderAttachableImpl#addAppender And check who is calling it? Since i

Re: Heap Problem with Checkpoints

2018-06-19 Thread Piotr Nowojski
Hi, Can you search the logs/std err/std output for log entries like: log.warn("Failed to locally delete blob “ …) ? I see in the code, that if file deletion fails for whatever the reason, TransientBlobCleanupTask can loop indefinitely trying to remove it over and over again. That might be ok,

Re: Heap Problem with Checkpoints

2018-06-11 Thread Piotr Nowojski
Hi, What kind of messages are those “logs about S3 operations”? Did you try to google search them? Maybe it’s a known S3 issue? Another approach is please use some heap space analyser from which you can backtrack classes that are referencing those “memory leaks” and again try to google any kno

Heap Problem with Checkpoints

2018-06-08 Thread Fabian Wollert
Hi, in this email thread here, i tried to set up S3 as a filesystem backend for checkpoints. Now everything is working (Flink V1.5.0), but the