[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328004#comment-17328004 ] Flink Jira Bot commented on FLINK-16468: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Labels: stale-major > Fix For: 1.13.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227405#comment-17227405 ] Jason Kania commented on FLINK-16468: - Up until now, I have been redirected from these activities. The problem still exists in that rapid reconnections occur, but I have not had any chance to investigate and looks like I won't for a while. If you wish to close, feel free and I will reference this issue if I am able to open and reinvestigate. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.12.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227323#comment-17227323 ] Till Rohrmann commented on FLINK-16468: --- [~longtimer] did you have the chance to reproduce the problem with debug logs? > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.12.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114792#comment-17114792 ] Gary Yao commented on FLINK-16468: -- [~longtimer] No problem, take care! > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114190#comment-17114190 ] Jason Kania commented on FLINK-16468: - Sorry [~gjy], not to this point. The current economic/health situation has resulted in a need to redirect our efforts for the moment. We have not done more testing in the short term. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113920#comment-17113920 ] Gary Yao commented on FLINK-16468: -- Are there any news [~longtimer]? > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067499#comment-17067499 ] Gary Yao commented on FLINK-16468: -- Thanks, looking forward to your reply. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066231#comment-17066231 ] Jason Kania commented on FLINK-16468: - [~gjy], the situation is unreliably unreliable, but I will try to collect some details over time. With Flink, Pulsar, Zookeeper and database connections in the mix, we get different errors/exceptions each time there is a network failure or we simulate one depending on which component is the first to encounter the networking issue and when different components attempt to recover. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065516#comment-17065516 ] Gary Yao commented on FLINK-16468: -- {quote}The 1 second delay only seems to moderate the CPU utilization but not help with the applications giving up and being left in an unknown state. {quote} That is valuable feedback. Are you able to reproduce this reliably, and can you share your log files with us (preferably on DEBUG level)? It would help us to understand the problem more in-depth, and we can run some experiments in a setup similar to yours. {quote}Maybe having a pluggable restart strategy for all components could better allow users to handle the particulars of each installation? {quote} This can be considered but it will likely be a bigger effort. Before we can plan for that, we would like to understand the problem better. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065174#comment-17065174 ] Jason Kania commented on FLINK-16468: - [~gjy], In looking at my particular situation as a reference, the network outage seemed to be just less than one minute and that was more than enough to bring Flink, the running jobs and the related queuing applications to the point where they were not recoverable on their own after the network recovered. What I hope from a recovery strategy, and have implemented in the telecoms industry in the past, is that this recovery can happen on its own. The 1 second delay only seems to moderate the CPU utilization but not help with the applications giving up and being left in an unknown state. Without some form of increasing backoff, you either need to have a large number of retries or expect the application to give up. Since the applications will take at least 10 seconds to restart, the different components bouncing at the same time in an outage such as this means there is just too much instability for all the components to recover. That said, I understand what you mean about not having control over the libraries such as Curator, for example. Just today, I managed to create an unhandled exception where the zookeeper client gave up, leaving Flink in a funny state by playing with the network interface. I also think that the 1.10 documentation is an improvement. The test will be once we migrate to using it. Maybe having a pluggable restart strategy for all components could better allow users to handle the particulars of each installation? > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065098#comment-17065098 ] Gary Yao commented on FLINK-16468: -- [~longtimer] Thanks for getting back so quickly. After reading your previous messages again, I am understanding that you would like to see a unified way of how retries (delays, attempts, etc.) are handled in the job restart and deployment path (please correct me if I am wrong). You are right that there is a plethora of knobs that can be tuned. However, it is not be required to manually tune them all and we strive for sane defaults that work well for all users. Beginning with the latest Flink release (1.10.0), the [configuration options documentation|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html] has been restructured to make it more obvious to users which options are important. In my opinion a unified retry configuration is difficult to implement because: * Different parts of the system have different requirements. For example, I already do not think it makes sense to wait exponentially in the BlobClient. * When using libraries, such as Curator, Akka, and even in the connectors we do not always have full control about the restart policy. Lastly, I wanted to ask you whether you think that this issue would have surfaced if your Flink cluster had used the new default restart delay of 1s? If this issue would not have surfaced, then I would question whether at the moment there is need to change anything at all. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064899#comment-17064899 ] Jason Kania commented on FLINK-16468: - [~gjy], to me, as a user, your restart strategy seems to lack a consolidate approach with the patchwork addition of individual small timeouts. I ask the open question of how many more of the small timeouts may have to be added for different components that exhibit the same instant restart approach in the face of external failures? I also question where users can expect to learn about each individual timer in detail and what each should be. In my opinion, users are left finding and tuning these values in the face of infrequent failures rather than having the option to strategically handle these situations proactively. Because the remedy you suggest seems unlikely to be effective in my situation, I am not interested in contributing to this particular proposed solution. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064757#comment-17064757 ] Gary Yao commented on FLINK-16468: -- {quote} I will happily update the user docs, but would appreciate some input on what the implications might be since my lack of experience on the implications was the part of the reason why this issue and https://issues.apache.org/jira/browse/FLINK-16470 were raised in the first place. {quote} We changed the default restart delay from 0s to 1s to mitigate restart storms (see [FLIP-62|https://cwiki.apache.org/confluence/display/FLINK/FLIP-62%3A+Set+default+restart+delay+for+FixedDelay-+and+FailureRateRestartStrategy+to+1s]). For example, if your job failed due to the data source being overloaded, frequent restarts will only worsen the situation as this will further increase load on the data source. {quote} Given the option, I would go with a backoff algorithm going something like 1,2,4,8,16... seconds which provides both user feedback and some chance for network recovery. {quote} I think waiting for 16s or more would be quite drastic. If a job restarts, the exception will be visible on the Web UI. However, if we sleep for (1 + 2 + 4 + 8 + 16) seconds on the TaskManager, the only feedback we provide to the user is through log files. I would be ok to introduce a configurable, low delay between BlobClient retries (e.g., by default 1s). Note, however, that this change would require a [FLIP|https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-Whatisconsidereda%22majorchange%22thatneedsaFLIP?] since we would introduce a new public interface. All in all, I think this issue has low priority at the moment. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064685#comment-17064685 ] Nico Kruber commented on FLINK-16468: - Actually, [~gjy] had a good point bringing in the task's restart delay because, obviously, both things are connected here. Therefore, I also think, adding a simple backoff time in the {{BlobClient}} would be enough. Let the rest be handled by the global restart strategy. As for that, I think, the existing restart strategies are probably also enough (and we don't need an exponential one for now) but may need a few more details on the implications of selecting short restart delays (since we want to be nice to the user). As long as this delay does not exhaust the number of sockets, it is fine to just have a fixed delay. If you think an exponential restart delay is required, I would propose to discuss this on the mailing list and gather input. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061785#comment-17061785 ] Jason Kania commented on FLINK-16468: - [~gjy] I will happily update the user docs, but would appreciate some input on what the implications might be since my lack of experience on the implications was the part of the reason why this issue and https://issues.apache.org/jira/browse/FLINK-16470 were raised in the first place. If you assign this to me, I can provide a backoff implementation. However, you mentioned a backoff time versus backoff algorithm as [~NicoK] mentioned and I was thinking. Given the option, I would go with a backoff algorithm going something like 1,2,4,8,16... seconds which provides both user feedback and some chance for network recovery. If the BlobClient follows the same approach, then it too would use the 'exponential' backoff as would all the components also currently tied to the restart delay. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061487#comment-17061487 ] Gary Yao commented on FLINK-16468: -- We could say a few words about the implications of a low restart delay in the user docs. Since the implications are very extensive, I would keep the description general and not about the BlobClient specifically. If you want to improve the user docs, feel free to open a new JIRA issue and cc me. Introducing a backoff time makes sense since we currently just exhaust all retry attempts without giving the network/services time to recover. However, I would still keep the retry delays low (i.e., a few seconds) because otherwise the user is left without feedback about the state of the deployment. If you want to work on this issue, let me know and I will assign it to you. The current 1 second restart delay probably already mitigates the issue. There will be at most 300 (60*5) BlobClient retries per minute, and the TIME-WAIT state is destroyed after [1 minute|https://github.com/torvalds/linux/blob/bd2463ac7d7ec51d432f23bf0e893fb371a908cd/include/net/tcp.h#L121]. Therefore, the current retry mechanism hogs at most 300 sockets per TM. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060282#comment-17060282 ] Jason Kania commented on FLINK-16468: - [~gjy] These were reported on the same day but were the result of distinct failures on different days. If they are both due to the restart delay then I would suggest more detail be added to the restart delay documentation text because right now its implications are not fully explained. Additionally, a 1 second default restart delay is still going to lead to sockets in the TIME-WAIT state. This would suggest that a backoff algorithm would be more appropriate than a fixed delay. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060264#comment-17060264 ] Gary Yao commented on FLINK-16468: -- [~longtimer] {{blob.fetch.retries}} is only 5 by default. Is this issue related to FLINK-16470 where you had a restart delay of 0? > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057498#comment-17057498 ] Jason Kania commented on FLINK-16468: - [~NicoK], I will try the blob.fetch.retries settings and see. As for the deployment, it is only 4 slots in one task manager on a 2 CPU system so was not expecting to exhaust the number of sockets either. It was the sheer number of retries really quickly that seems to have done it so it may have been all the closed but TIME-WAIT status socket connections at the OS level that were not yet available for reuse that was causing the issue. If the issue happens again, I will see if more information is available. However, a backoff algorithm does seem to be a good plan. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying...}} > The above is output repeatedly until the following error occurs: > {{java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)}} > {{ at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}} > {{ at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}} > {{ at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}} > {{ at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}} > {{ at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}} > {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}} > {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}} > {{ at java.lang.Thread.run(Thread.java:748)}} > {{Caused by: java.net.SocketException: Too many open files}} > {{ at java.net.Socket.createImpl(Socket.java:478)}} > {{ at java.net.Socket.connect(Socket.java:605)}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)}} > {{ ... 8 more}} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056755#comment-17056755 ] Nico Kruber commented on FLINK-16468: - Hi [~longtimer], I verified in the code that the sockets which are opened for each retry are closed properly, however, TCP sockets enter in to a TIME-WAIT status and wait for a little while longer until they are really cleaned up [1]. You could try changing your kernel's settings to enable fast reuse of sockets to cope with that [1] or increase the limit on number of open sockets, but I agree that having some (exponential) back-off may be a better solution. Reducing {{blob.fetch.retries}} may also be an option for now. I'm still a bit wondering why you exhaust the number of sockets. Are you maybe deploying to a larger set of TMs on the same machine or one TM with a lot of slots or a huge number of tasks? [1] https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying...}} > The above is output repeatedly until the following error occurs: > {{java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)}} > {{ at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}} > {{ at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}} > {{ at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}} > {{ at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}} > {{ at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}} > {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}} > {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}} > {{ at java.lang.Thread.run(Thread.java:748)}} > {{Caused by: java.net.SocketException: Too many open files}} > {{ at java.net.Socket.createImpl(Socket.java:478)}} > {{ at java.net.Socket.connect(Socket.java:605)}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)}} > {{ ... 8 more}} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056533#comment-17056533 ] Jason Kania commented on FLINK-16468: - [~azagrebin], the only thing I saw was a lot of repeats of the IOException before the last one included here and the following SocketException. I did not see anything preceding it and the logs were deleted because of the flood and excess disk utilization. The debug logs for the BlobClient are now enabled. I will update this issue if the error occurs again. The blob.fetch.retries was not modified from the default value. > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying...}} > The above is output repeatedly until the following error occurs: > {{java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)}} > {{ at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}} > {{ at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}} > {{ at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}} > {{ at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}} > {{ at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}} > {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}} > {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}} > {{ at java.lang.Thread.run(Thread.java:748)}} > {{Caused by: java.net.SocketException: Too many open files}} > {{ at java.net.Socket.createImpl(Socket.java:478)}} > {{ at java.net.Socket.connect(Socket.java:605)}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)}} > {{ ... 8 more}} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056020#comment-17056020 ] Andrey Zagrebin commented on FLINK-16468: - Thanks for reporting this [~longtimer] Could you attach the full logs? Could you enable debug logs for org.apache.flink.runtime.blob.BlobClient to see all underlying reasons for retrying? Have you changed option "blob.fetch.retries"? The problem may also be that the socket is not properly closed after some other failure. cc [~NicoK] > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying...}} > The above is output repeatedly until the following error occurs: > {{java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)}} > {{ at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}} > {{ at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}} > {{ at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}} > {{ at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}} > {{ at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}} > {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}} > {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}} > {{ at java.lang.Thread.run(Thread.java:748)}} > {{Caused by: java.net.SocketException: Too many open files}} > {{ at java.net.Socket.createImpl(Socket.java:478)}} > {{ at java.net.Socket.connect(Socket.java:605)}} > {{ at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)}} > {{ ... 8 more}} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. > -- This message was sent by Atlassian Jira (v8.3.4#803005)