[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

2019-07-18 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888411#comment-16888411
 ] 

Joey Echeverria commented on FLINK-11205:
-

I didn't get a chance to reproduce this using a sample job, but we found two 
causes for our jobs.

(1) Apache Commons Logging was caching LogFactory instances. We added the 
following code to our close() methods to release those LogFactories:

{code:java}
  ClassLoader contextLoader = 
Thread.currentThread().getContextClassLoader();
  LogFactory.release(contextLoader);
{code}

(2) We saw a leak that seemed to have an upper limit in ObjectStreamClass's 
caches. This one was trickier as we had to use reflection to clear out the 
caches, again in the close() method:

{code:java}
public static void cleanUpLeakingObjects(ClassLoader contextLoader) {
try {
  Class caches = Class.forName("java.io.ObjectStreamClass$Caches");
  clearCache(caches, "localDescs", contextLoader);
  clearCache(caches, "reflectors", contextLoader);
} catch (ReflectiveOperationException | SecurityException | 
ClassCastException ex) {
  // Clean-up failed
  logger.warn("Cleanup of ObjectStreamClass caches failed with exception 
{}: {}", ex.getClass().getSimpleName(),
ex.getMessage());
  logger.debug("Stack trace follows.", ex);
}
  }


  private static void clearCache(Class caches, String mapName, ClassLoader 
contextLoader)
throws ReflectiveOperationException, SecurityException, ClassCastException {
Field field = caches.getDeclaredField(mapName);
field.setAccessible(true);

Map map = TypeUtils.coerce(field.get(null));
Iterator keys = map.keySet().iterator();
while (keys.hasNext()) {
  Object key = keys.next();
  if (key instanceof Reference) {
Object clazz = ((Reference) key).get();
if (clazz instanceof Class) {
  ClassLoader cl = ((Class) clazz).getClassLoader();
  while (cl != null) {
if (cl == contextLoader) {
  keys.remove();
  break;
}
cl = cl.getParent();
  }
}
  }
}
  }
{code}

> Task Manager Metaspace Memory Leak 
> ---
>
> Key: FLINK-11205
> URL: https://issues.apache.org/jira/browse/FLINK-11205
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
>Reporter: Nawaid Shamim
>Priority: Major
> Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

2019-07-17 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887338#comment-16887338
 ] 

Joey Echeverria commented on FLINK-11205:
-

I'm seeing this same issue when running Flink jobs that are in a restart loop. 
The restart policy is set with a very large number of max attempts and the 
Flink job throws an exception on each record. I'll see if I can reproduce with 
one of the example jobs.

> Task Manager Metaspace Memory Leak 
> ---
>
> Key: FLINK-11205
> URL: https://issues.apache.org/jira/browse/FLINK-11205
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
>Reporter: Nawaid Shamim
>Priority: Major
> Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did

2019-02-20 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773470#comment-16773470
 ] 

Joey Echeverria commented on FLINK-11665:
-

That makes sense [~basharaj]. I think that's more likely than my previous 
theory.

> Flink fails to remove JobGraph from ZK even though it reports it did
> 
>
> Key: FLINK-11665
> URL: https://issues.apache.org/jira/browse/FLINK-11665
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.5
>Reporter: Bashar Abdul Jawad
>Priority: Major
> Attachments: FLINK-11665.csv
>
>
> We recently have seen the following issue with Flink 1.5.5:
> Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 
> 1- A job is activated successfully and the job graph added to ZK:
> {code:java}
> Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
> {code}
> 2- Job is deactivated, Flink reports that the job graph has been successfully 
> removed from ZK and the blob is deleted from the blob server (in this case 
> S3):
> {code:java}
> Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
> {code}
> 3- JM is later restarted, Flink for some reason attempts to recover the job 
> that it reported earlier it has removed from ZK but since the blob has 
> already been deleted the JM goes into a crash loop. The only way to recover 
> it manually is to remove the job graph entry from ZK:
> {code:java}
> Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).  
> {code}
> and
> {code:java}
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The specified key does not exist. (Service: Amazon S3; Status Code: 404; 
> Error Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: 
> OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= 
> (Path: 
> s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
> {code}
> My question is under what circumstances would this happen? this seems to 
> happen very infrequently but since the consequence is severe (JM crash loop) 
> we'd like to understand how it would happen.
> This  all seems a little similar to 
> https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported 
> fixed in Flink 1.5.2 and we are already on Flink 1.5.5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did

2019-02-20 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773282#comment-16773282
 ] 

Joey Echeverria commented on FLINK-11665:
-

[~basharaj], I see one code path where jobs are released after they were 
recovered: 
https://github.com/apache/flink/blob/release-1.5.5/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L657-L674

What I think is happening is that while the Dispatcher is trying to recover all 
jobs, it gets an exception. This triggers a release of all recovered job graphs 
which deletes the lock from zookeeper and removes the job from the 
{{addedJobGraphs}} cache in {{ZooKeeperSubmittedJobGraphStore}}. When the job 
is then stopped, you get the error that you noticed where the ZK state is left 
because the job wasn't in the cache. I don't see anywhere that the exception is 
logged so I'm not sure that I can confirm that one of the jobs failed to 
recover causing the release, but it seems most likely based on my reading of 
the source code.

> Flink fails to remove JobGraph from ZK even though it reports it did
> 
>
> Key: FLINK-11665
> URL: https://issues.apache.org/jira/browse/FLINK-11665
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.5
>Reporter: Bashar Abdul Jawad
>Priority: Major
>
> We recently have seen the following issue with Flink 1.5.5:
> Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 
> 1- A job is activated successfully and the job graph added to ZK:
> {code:java}
> Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
> {code}
> 2- Job is deactivated, Flink reports that the job graph has been successfully 
> removed from ZK and the blob is deleted from the blob server (in this case 
> S3):
> {code:java}
> Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
> {code}
> 3- JM is later restarted, Flink for some reason attempts to recover the job 
> that it reported earlier it has removed from ZK but since the blob has 
> already been deleted the JM goes into a crash loop. The only way to recover 
> it manually is to remove the job graph entry from ZK:
> {code:java}
> Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).  
> {code}
> and
> {code:java}
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The specified key does not exist. (Service: Amazon S3; Status Code: 404; 
> Error Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: 
> OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= 
> (Path: 
> s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
> {code}
> My question is under what circumstances would this happen? this seems to 
> happen very infrequently but since the consequence is severe (JM crash loop) 
> we'd like to understand how it would happen.
> This  all seems a little similar to 
> https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported 
> fixed in Flink 1.5.2 and we are already on Flink 1.5.5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely

2019-02-12 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766600#comment-16766600
 ] 

Joey Echeverria commented on FLINK-10213:
-

Hi [~dawidwys],

Sorry for the delay responding. What you wrote makes sense, though I'm still a 
little worried about adding an extra layer of caching of the DNS resolution in 
the InetSocketAddress. When the bug first hit the behavior was that the 
affected TaskManager never recovered. So I'm worried we could run into another 
situation where the InetSocketAddress is resolved, but is for some reason 
stale. The main benefit I can see to not re-doing the look-up each time is to 
avoid an extra DNS lookup. However, both the JVM and the kernel maintains a DNS 
cache based on the TTL of the DNS entry. I trust those caches because they 
respect the TTL of the entry unlike the InetSocketAddress which caches 
indefinitely.

If you feel strongly that we check the isResolved() status first, I'll disagree 
and commit to that solution. Let me know how you want me to proceed.

> Task managers cache a negative DNS lookup of the blob server indefinitely
> -
>
> Key: FLINK-10213
> URL: https://issues.apache.org/jira/browse/FLINK-10213
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Affects Versions: 1.5.0
>Reporter: Joey Echeverria
>Assignee: Joey Echeverria
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>
> When the task manager establishes a connection with the resource manager, it 
> gets the hostname and port of the blob server and uses that to create an 
> instance of an {{InetSocketAddress}}. Per the documentation of the 
> constructor:
> {quote}An attempt will be made to resolve the hostname into an InetAddress. 
> If that attempt fails, the address will be flagged as _unresolved_{quote}
> Flink never checks to see if the address was unresolved. Later when executing 
> a task that needs to download from the blob server, it will use that same 
> {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will 
> result in an exception similar to:
> {noformat}
> java.io.IOException: Failed to fetch BLOB 
> 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f
>  from jobmanager:6124 and store it under 
> /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863)
>  [flink-dist_2.11-1.5.0.jar:1.5.0]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) 
> [flink-dist_2.11-1.5.0.jar:1.5.0]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> Caused by: java.io.IOException: Could not connect to BlobServer at address 
> flink-jobmanager:6124
>   at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   ... 6 more
> Caused by: java.net.UnknownHostException: jobmanager
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) 
> ~[?:1.8.0_171]
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
> ~[?:1.8.0_171]
>   at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171]
>   at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171]
>   at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   ... 6 more
> {noformat}
> Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of 
> any tasks that are executed on that task manager and the only current 
> workaround is to manually restart the task manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely

2018-08-25 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592613#comment-16592613
 ] 

Joey Echeverria commented on FLINK-10213:
-

Yes I think that’d be the easiest fix. Java and the OS will already cache the 
resolution based on the TTL of the DNS record and any settings for caching 
negative responses, so there’s no reason for Flink to effectively add its own 
name cache. 

> Task managers cache a negative DNS lookup of the blob server indefinitely
> -
>
> Key: FLINK-10213
> URL: https://issues.apache.org/jira/browse/FLINK-10213
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Affects Versions: 1.5.0
>Reporter: Joey Echeverria
>Priority: Major
>
> When the task manager establishes a connection with the resource manager, it 
> gets the hostname and port of the blob server and uses that to create an 
> instance of an {{InetSocketAddress}}. Per the documentation of the 
> constructor:
> {quote}An attempt will be made to resolve the hostname into an InetAddress. 
> If that attempt fails, the address will be flagged as _unresolved_{quote}
> Flink never checks to see if the address was unresolved. Later when executing 
> a task that needs to download from the blob server, it will use that same 
> {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will 
> result in an exception similar to:
> {noformat}
> java.io.IOException: Failed to fetch BLOB 
> 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f
>  from jobmanager:6124 and store it under 
> /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863)
>  [flink-dist_2.11-1.5.0.jar:1.5.0]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) 
> [flink-dist_2.11-1.5.0.jar:1.5.0]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> Caused by: java.io.IOException: Could not connect to BlobServer at address 
> flink-jobmanager:6124
>   at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   ... 6 more
> Caused by: java.net.UnknownHostException: jobmanager
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) 
> ~[?:1.8.0_171]
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
> ~[?:1.8.0_171]
>   at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171]
>   at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171]
>   at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>   ... 6 more
> {noformat}
> Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of 
> any tasks that are executed on that task manager and the only current 
> workaround is to manually restart the task manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10213) Task managers cache a negative DNS lookup of the blob server indefinitely

2018-08-24 Thread Joey Echeverria (JIRA)
Joey Echeverria created FLINK-10213:
---

 Summary: Task managers cache a negative DNS lookup of the blob 
server indefinitely
 Key: FLINK-10213
 URL: https://issues.apache.org/jira/browse/FLINK-10213
 Project: Flink
  Issue Type: Bug
  Components: TaskManager
Affects Versions: 1.5.0
Reporter: Joey Echeverria


When the task manager establishes a connection with the resource manager, it 
gets the hostname and port of the blob server and uses that to create an 
instance of an {{InetSocketAddress}}. Per the documentation of the constructor:
{quote}An attempt will be made to resolve the hostname into an InetAddress. If 
that attempt fails, the address will be flagged as _unresolved_{quote}
Flink never checks to see if the address was unresolved. Later when executing a 
task that needs to download from the blob server, it will use that same 
{{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will 
result in an exception similar to:
{noformat}
java.io.IOException: Failed to fetch BLOB 
97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f
 from jobmanager:6124 and store it under 
/tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669
at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863)
 [flink-dist_2.11-1.5.0.jar:1.5.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) 
[flink-dist_2.11-1.5.0.jar:1.5.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.io.IOException: Could not connect to BlobServer at address 
flink-jobmanager:6124
at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:124) 
~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
... 6 more
Caused by: java.net.UnknownHostException: jobmanager
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) 
~[?:1.8.0_171]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[?:1.8.0_171]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171]
at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171]
at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:118) 
~[flink-dist_2.11-1.5.0.jar:1.5.0]
at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
 ~[flink-dist_2.11-1.5.0.jar:1.5.0]
... 6 more
{noformat}

Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of 
any tasks that are executed on that task manager and the only current 
workaround is to manually restart the task manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10135) The JobManager doesn't report the cluster-level metrics

2018-08-13 Thread Joey Echeverria (JIRA)
Joey Echeverria created FLINK-10135:
---

 Summary: The JobManager doesn't report the cluster-level metrics
 Key: FLINK-10135
 URL: https://issues.apache.org/jira/browse/FLINK-10135
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 1.5.0
Reporter: Joey Echeverria


In [the documentation for 
metrics|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/metrics.html#cluster]
 in the Flink 1.5.0 release, it says that the following metrics are reported by 
the JobManager:
{noformat}
numRegisteredTaskManagers
numRunningJobs
taskSlotsAvailable
taskSlotsTotal
{noformat}

In the job manager REST endpoint 
({{http://:8081/jobmanager/metrics}}), those metrics don't appear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10078) Dispatcher should only block job submission during recovery

2018-08-06 Thread Joey Echeverria (JIRA)
Joey Echeverria created FLINK-10078:
---

 Summary: Dispatcher should only block job submission during 
recovery
 Key: FLINK-10078
 URL: https://issues.apache.org/jira/browse/FLINK-10078
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Joey Echeverria


The Dispatcher currently doesn't confirm leadership until all jobs are 
recovered. This prevents any operations that require an active Dispatcher from 
working until after job recovery. This is primarily done to prevent race 
conditions between client retries and recovering jobs. An alternative approach 
would be to only block job submission while recovery is happening.

 

Note: we also need to check that no other RPCs change the internal state in 
such a way that it interferes with the job recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10077) Web UI should respond when there is no elected Dispatcher

2018-08-06 Thread Joey Echeverria (JIRA)
Joey Echeverria created FLINK-10077:
---

 Summary: Web UI should respond when there is no elected Dispatcher
 Key: FLINK-10077
 URL: https://issues.apache.org/jira/browse/FLINK-10077
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Affects Versions: 1.5.0
Reporter: Joey Echeverria


The Dispatcher provides much of the relevant information that you see in the 
web UI and currently the UI won't show at all until a Dispatcher leader is 
elected. An improved user experience would be to have the UI show what 
information it can show and only fail to display the information that it cannot 
(e.g. the number of running jobs).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)