[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak
[ https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888411#comment-16888411 ] Joey Echeverria commented on FLINK-11205: - I didn't get a chance to reproduce this using a sample job, but we found two causes for our jobs. (1) Apache Commons Logging was caching LogFactory instances. We added the following code to our close() methods to release those LogFactories: {code:java} ClassLoader contextLoader = Thread.currentThread().getContextClassLoader(); LogFactory.release(contextLoader); {code} (2) We saw a leak that seemed to have an upper limit in ObjectStreamClass's caches. This one was trickier as we had to use reflection to clear out the caches, again in the close() method: {code:java} public static void cleanUpLeakingObjects(ClassLoader contextLoader) { try { Class caches = Class.forName("java.io.ObjectStreamClass$Caches"); clearCache(caches, "localDescs", contextLoader); clearCache(caches, "reflectors", contextLoader); } catch (ReflectiveOperationException | SecurityException | ClassCastException ex) { // Clean-up failed logger.warn("Cleanup of ObjectStreamClass caches failed with exception {}: {}", ex.getClass().getSimpleName(), ex.getMessage()); logger.debug("Stack trace follows.", ex); } } private static void clearCache(Class caches, String mapName, ClassLoader contextLoader) throws ReflectiveOperationException, SecurityException, ClassCastException { Field field = caches.getDeclaredField(mapName); field.setAccessible(true); Map map = TypeUtils.coerce(field.get(null)); Iterator keys = map.keySet().iterator(); while (keys.hasNext()) { Object key = keys.next(); if (key instanceof Reference) { Object clazz = ((Reference) key).get(); if (clazz instanceof Class) { ClassLoader cl = ((Class) clazz).getClassLoader(); while (cl != null) { if (cl == contextLoader) { keys.remove(); break; } cl = cl.getParent(); } } } } } {code} > Task Manager Metaspace Memory Leak > --- > > Key: FLINK-11205 > URL: https://issues.apache.org/jira/browse/FLINK-11205 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.5, 1.6.2, 1.7.0 >Reporter: Nawaid Shamim >Priority: Major > Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot > 2018-12-18 at 15.47.55.png > > > Job Restarts causes task manager to dynamically load duplicate classes. > Metaspace is unbounded and grows with every restart. YARN aggressively kill > such containers but this affect is immediately seems on different task > manager which results in death spiral. > Task Manager uses dynamic loader as described in > [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html] > {quote} > *YARN* > YARN classloading differs between single job deployments and sessions: > * When submitting a Flink job/application directly to YARN (via {{bin/flink > run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are > started for that job. Those JVMs have both Flink framework classes and user > code classes in the Java classpath. That means that there is _no dynamic > classloading_ involved in that case. > * When starting a YARN session, the JobManagers and TaskManagers are started > with the Flink framework classes in the classpath. The classes from all jobs > that are submitted against the session are loaded dynamically. > {quote} > The above is not entirely true specially when you set {{-yD > classloader.resolve-order=parent-first}} . We also above observed the above > behaviour when submitting a Flink job/application directly to YARN (via > {{bin/flink run -m yarn-cluster ...}}). > !Screenshot 2018-12-18 at 12.14.11.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak
[ https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887338#comment-16887338 ] Joey Echeverria commented on FLINK-11205: - I'm seeing this same issue when running Flink jobs that are in a restart loop. The restart policy is set with a very large number of max attempts and the Flink job throws an exception on each record. I'll see if I can reproduce with one of the example jobs. > Task Manager Metaspace Memory Leak > --- > > Key: FLINK-11205 > URL: https://issues.apache.org/jira/browse/FLINK-11205 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.5, 1.6.2, 1.7.0 >Reporter: Nawaid Shamim >Priority: Major > Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot > 2018-12-18 at 15.47.55.png > > > Job Restarts causes task manager to dynamically load duplicate classes. > Metaspace is unbounded and grows with every restart. YARN aggressively kill > such containers but this affect is immediately seems on different task > manager which results in death spiral. > Task Manager uses dynamic loader as described in > [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html] > {quote} > *YARN* > YARN classloading differs between single job deployments and sessions: > * When submitting a Flink job/application directly to YARN (via {{bin/flink > run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are > started for that job. Those JVMs have both Flink framework classes and user > code classes in the Java classpath. That means that there is _no dynamic > classloading_ involved in that case. > * When starting a YARN session, the JobManagers and TaskManagers are started > with the Flink framework classes in the classpath. The classes from all jobs > that are submitted against the session are loaded dynamically. > {quote} > The above is not entirely true specially when you set {{-yD > classloader.resolve-order=parent-first}} . We also above observed the above > behaviour when submitting a Flink job/application directly to YARN (via > {{bin/flink run -m yarn-cluster ...}}). > !Screenshot 2018-12-18 at 12.14.11.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did
[ https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773470#comment-16773470 ] Joey Echeverria commented on FLINK-11665: - That makes sense [~basharaj]. I think that's more likely than my previous theory. > Flink fails to remove JobGraph from ZK even though it reports it did > > > Key: FLINK-11665 > URL: https://issues.apache.org/jira/browse/FLINK-11665 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.5.5 >Reporter: Bashar Abdul Jawad >Priority: Major > Attachments: FLINK-11665.csv > > > We recently have seen the following issue with Flink 1.5.5: > Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: > 1- A job is activated successfully and the job graph added to ZK: > {code:java} > Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper. > {code} > 2- Job is deactivated, Flink reports that the job graph has been successfully > removed from ZK and the blob is deleted from the blob server (in this case > S3): > {code:java} > Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper. > {code} > 3- JM is later restarted, Flink for some reason attempts to recover the job > that it reported earlier it has removed from ZK but since the blob has > already been deleted the JM goes into a crash loop. The only way to recover > it manually is to remove the job graph entry from ZK: > {code:java} > Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null). > {code} > and > {code:java} > org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The specified key does not exist. (Service: Amazon S3; Status Code: 404; > Error Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: > OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= > (Path: > s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3) > {code} > My question is under what circumstances would this happen? this seems to > happen very infrequently but since the consequence is severe (JM crash loop) > we'd like to understand how it would happen. > This all seems a little similar to > https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported > fixed in Flink 1.5.2 and we are already on Flink 1.5.5 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did
[ https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773282#comment-16773282 ] Joey Echeverria commented on FLINK-11665: - [~basharaj], I see one code path where jobs are released after they were recovered: https://github.com/apache/flink/blob/release-1.5.5/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L657-L674 What I think is happening is that while the Dispatcher is trying to recover all jobs, it gets an exception. This triggers a release of all recovered job graphs which deletes the lock from zookeeper and removes the job from the {{addedJobGraphs}} cache in {{ZooKeeperSubmittedJobGraphStore}}. When the job is then stopped, you get the error that you noticed where the ZK state is left because the job wasn't in the cache. I don't see anywhere that the exception is logged so I'm not sure that I can confirm that one of the jobs failed to recover causing the release, but it seems most likely based on my reading of the source code. > Flink fails to remove JobGraph from ZK even though it reports it did > > > Key: FLINK-11665 > URL: https://issues.apache.org/jira/browse/FLINK-11665 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.5.5 >Reporter: Bashar Abdul Jawad >Priority: Major > > We recently have seen the following issue with Flink 1.5.5: > Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: > 1- A job is activated successfully and the job graph added to ZK: > {code:java} > Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper. > {code} > 2- Job is deactivated, Flink reports that the job graph has been successfully > removed from ZK and the blob is deleted from the blob server (in this case > S3): > {code:java} > Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper. > {code} > 3- JM is later restarted, Flink for some reason attempts to recover the job > that it reported earlier it has removed from ZK but since the blob has > already been deleted the JM goes into a crash loop. The only way to recover > it manually is to remove the job graph entry from ZK: > {code:java} > Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null). > {code} > and > {code:java} > org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The specified key does not exist. (Service: Amazon S3; Status Code: 404; > Error Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: > OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= > (Path: > s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3) > {code} > My question is under what circumstances would this happen? this seems to > happen very infrequently but since the consequence is severe (JM crash loop) > we'd like to understand how it would happen. > This all seems a little similar to > https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported > fixed in Flink 1.5.2 and we are already on Flink 1.5.5 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely
[ https://issues.apache.org/jira/browse/FLINK-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766600#comment-16766600 ] Joey Echeverria commented on FLINK-10213: - Hi [~dawidwys], Sorry for the delay responding. What you wrote makes sense, though I'm still a little worried about adding an extra layer of caching of the DNS resolution in the InetSocketAddress. When the bug first hit the behavior was that the affected TaskManager never recovered. So I'm worried we could run into another situation where the InetSocketAddress is resolved, but is for some reason stale. The main benefit I can see to not re-doing the look-up each time is to avoid an extra DNS lookup. However, both the JVM and the kernel maintains a DNS cache based on the TTL of the DNS entry. I trust those caches because they respect the TTL of the entry unlike the InetSocketAddress which caches indefinitely. If you feel strongly that we check the isResolved() status first, I'll disagree and commit to that solution. Let me know how you want me to proceed. > Task managers cache a negative DNS lookup of the blob server indefinitely > - > > Key: FLINK-10213 > URL: https://issues.apache.org/jira/browse/FLINK-10213 > Project: Flink > Issue Type: Bug > Components: TaskManager >Affects Versions: 1.5.0 >Reporter: Joey Echeverria >Assignee: Joey Echeverria >Priority: Major > Labels: pull-request-available > Fix For: 1.8.0 > > > When the task manager establishes a connection with the resource manager, it > gets the hostname and port of the blob server and uses that to create an > instance of an {{InetSocketAddress}}. Per the documentation of the > constructor: > {quote}An attempt will be made to resolve the hostname into an InetAddress. > If that attempt fails, the address will be flagged as _unresolved_{quote} > Flink never checks to see if the address was unresolved. Later when executing > a task that needs to download from the blob server, it will use that same > {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will > result in an exception similar to: > {noformat} > java.io.IOException: Failed to fetch BLOB > 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f > from jobmanager:6124 and store it under > /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669 > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863) > [flink-dist_2.11-1.5.0.jar:1.5.0] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) > [flink-dist_2.11-1.5.0.jar:1.5.0] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] > Caused by: java.io.IOException: Could not connect to BlobServer at address > flink-jobmanager:6124 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > ... 6 more > Caused by: java.net.UnknownHostException: jobmanager > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) > ~[?:1.8.0_171] > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > ~[?:1.8.0_171] > at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171] > at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171] > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > ... 6 more > {noformat} > Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of > any tasks that are executed on that task manager and the only current > workaround is to manually restart the task manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely
[ https://issues.apache.org/jira/browse/FLINK-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592613#comment-16592613 ] Joey Echeverria commented on FLINK-10213: - Yes I think that’d be the easiest fix. Java and the OS will already cache the resolution based on the TTL of the DNS record and any settings for caching negative responses, so there’s no reason for Flink to effectively add its own name cache. > Task managers cache a negative DNS lookup of the blob server indefinitely > - > > Key: FLINK-10213 > URL: https://issues.apache.org/jira/browse/FLINK-10213 > Project: Flink > Issue Type: Bug > Components: TaskManager >Affects Versions: 1.5.0 >Reporter: Joey Echeverria >Priority: Major > > When the task manager establishes a connection with the resource manager, it > gets the hostname and port of the blob server and uses that to create an > instance of an {{InetSocketAddress}}. Per the documentation of the > constructor: > {quote}An attempt will be made to resolve the hostname into an InetAddress. > If that attempt fails, the address will be flagged as _unresolved_{quote} > Flink never checks to see if the address was unresolved. Later when executing > a task that needs to download from the blob server, it will use that same > {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will > result in an exception similar to: > {noformat} > java.io.IOException: Failed to fetch BLOB > 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f > from jobmanager:6124 and store it under > /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669 > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863) > [flink-dist_2.11-1.5.0.jar:1.5.0] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) > [flink-dist_2.11-1.5.0.jar:1.5.0] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] > Caused by: java.io.IOException: Could not connect to BlobServer at address > flink-jobmanager:6124 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > ... 6 more > Caused by: java.net.UnknownHostException: jobmanager > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) > ~[?:1.8.0_171] > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > ~[?:1.8.0_171] > at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171] > at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171] > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) > ~[flink-dist_2.11-1.5.0.jar:1.5.0] > ... 6 more > {noformat} > Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of > any tasks that are executed on that task manager and the only current > workaround is to manually restart the task manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely
Joey Echeverria created FLINK-10213: --- Summary: Task managers cache a negative DNS lookup of the blob server indefinitely Key: FLINK-10213 URL: https://issues.apache.org/jira/browse/FLINK-10213 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 1.5.0 Reporter: Joey Echeverria When the task manager establishes a connection with the resource manager, it gets the hostname and port of the blob server and uses that to create an instance of an {{InetSocketAddress}}. Per the documentation of the constructor: {quote}An attempt will be made to resolve the hostname into an InetAddress. If that attempt fails, the address will be flagged as _unresolved_{quote} Flink never checks to see if the address was unresolved. Later when executing a task that needs to download from the blob server, it will use that same {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will result in an exception similar to: {noformat} java.io.IOException: Failed to fetch BLOB 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f from jobmanager:6124 and store it under /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669 at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863) [flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) [flink-dist_2.11-1.5.0.jar:1.5.0] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: java.io.IOException: Could not connect to BlobServer at address flink-jobmanager:6124 at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) ~[flink-dist_2.11-1.5.0.jar:1.5.0] ... 6 more Caused by: java.net.UnknownHostException: jobmanager at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) ~[?:1.8.0_171] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_171] at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171] at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171] at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) ~[flink-dist_2.11-1.5.0.jar:1.5.0] at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165) ~[flink-dist_2.11-1.5.0.jar:1.5.0] ... 6 more {noformat} Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of any tasks that are executed on that task manager and the only current workaround is to manually restart the task manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10135) The JobManager doesn't report the cluster-level metrics
Joey Echeverria created FLINK-10135: --- Summary: The JobManager doesn't report the cluster-level metrics Key: FLINK-10135 URL: https://issues.apache.org/jira/browse/FLINK-10135 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 1.5.0 Reporter: Joey Echeverria In [the documentation for metrics|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/metrics.html#cluster] in the Flink 1.5.0 release, it says that the following metrics are reported by the JobManager: {noformat} numRegisteredTaskManagers numRunningJobs taskSlotsAvailable taskSlotsTotal {noformat} In the job manager REST endpoint ({{http://:8081/jobmanager/metrics}}), those metrics don't appear. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10078) Dispatcher should only block job submission during recovery
Joey Echeverria created FLINK-10078: --- Summary: Dispatcher should only block job submission during recovery Key: FLINK-10078 URL: https://issues.apache.org/jira/browse/FLINK-10078 Project: Flink Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Joey Echeverria The Dispatcher currently doesn't confirm leadership until all jobs are recovered. This prevents any operations that require an active Dispatcher from working until after job recovery. This is primarily done to prevent race conditions between client retries and recovering jobs. An alternative approach would be to only block job submission while recovery is happening. Note: we also need to check that no other RPCs change the internal state in such a way that it interferes with the job recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10077) Web UI should respond when there is no elected Dispatcher
Joey Echeverria created FLINK-10077: --- Summary: Web UI should respond when there is no elected Dispatcher Key: FLINK-10077 URL: https://issues.apache.org/jira/browse/FLINK-10077 Project: Flink Issue Type: Improvement Components: Webfrontend Affects Versions: 1.5.0 Reporter: Joey Echeverria The Dispatcher provides much of the relevant information that you see in the web UI and currently the UI won't show at all until a Dispatcher leader is elected. An improved user experience would be to have the UI show what information it can show and only fail to display the information that it cannot (e.g. the number of running jobs). -- This message was sent by Atlassian JIRA (v7.6.3#76005)