Hi Andreas, Thanks for reaching out .. this should not happen ... Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )
On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <andreas.ha...@gs.com> wrote: > Hi team, > > > > We’ve observed that when we submit a decent number of jobs in parallel > from a single Job Master, we encounter job failures due with Connection > Refused exceptions. We’ve seen this behavior start at 30 jobs running in > parallel. It’s seemingly transient, however, as upon several retries the > job succeeds. The surface level error varies, but digging deeper in stack > traces it looks to stem from the Job Manager no longer accepting > connections. > > > > I’ve included a couple of examples below from failed jobs’ driver logs, > with different errors stemming from a connection refused error: > > > > First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task > Manager memory - 30 jobs submitted in parallel, each with parallelism of 1 > > *Job Manager is running @ d43723-563.dc.gs.com > <http://d43723-563.dc.gs.com>*: Using job manager web tracking url <a > href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface ( > http://d43723-563.dc.gs.com:41268) </a> > > org.apache.flink.client.program.ProgramInvocationException: Could not > retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6) > > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326) > > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) > > ... > > Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: > Could not complete the operation. Number of retries has been exhausted. > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) > > ... 1 more > > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: > *Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268 > <http://d43723-563.dc.gs.com/10.47.126.221:41268>* > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) > > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > > ... 16 more > > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268 > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) > > ... 6 more > > Caused by: java.net.ConnectException: Connection refused > > > > Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 > Task Manager memory - 60 jobs submitted in parallel, each with parallelism > of 1 > > *Job Manager is running @ d43723-484.dc.gs.com > <http://d43723-484.dc.gs.com>*: Using job manager web tracking url <a > href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface ( > http://d43723-484.dc.gs.com:36757) </a> > > org.apache.flink.client.program.ProgramInvocationException: Could not > retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d) > > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326) > > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) > > ... > > Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed > to submit JobGraph. > > at > org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382) > > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) > > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561) > > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929) > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > > ... 3 more > > Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could > not upload job files.] > > at > org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389) > > at > org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373) > > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) > > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > > ... 4 more > > ... (this pattern repeats for number of unique JobIDs) > > Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could > not upload job files.] > > at > org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389) > > at > org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373) > > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) > > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > ... > > 26 05:46:39,734 [CASHFLOW-18394] WARN FlinkClusterStateMonitor - Error > while attempting to fetch job details for job > 4d20537a676df2855e29b31b1de1ead5 > > com.gs.ep.data.lake.refinerlib.restful.RestfulException: * failed > connecting to > http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 > <http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5> > after 1 time(s)* > > *Caused by: java.net.ConnectException: Connection refused* > > at java.net.PlainSocketImpl.socketConnect(Native Method) > > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > > at java.net.Socket.connect(Socket.java:589) > > at java.net.Socket.connect(Socket.java:538) > > at sun.net.NetworkClient.doConnect(NetworkClient.java:180) > > > > These connection refusal exceptions and their transient nature makes me > think that it might be a network-related issue. It’s not uncommon for us to > need to run 100+ jobs in parallel. How can we investigate what’s causing > the Job Manager to periodically refuse connections? I can see a Netty > package in the first example’s stack trace – is there any option we can > tune? > > > > ____________ > > > > *Andreas Hailu* > > *Data Lake Engineering *| Goldman Sachs & Co. > > > > ------------------------------ > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices >