[jira] [Commented] (FLINK-12312) Temporarily disable CLI command for rescaling
[ https://issues.apache.org/jira/browse/FLINK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607104#comment-17607104 ] Gary Yao commented on FLINK-12312: -- [~Zhanghao Chen] I updated the link in the issue. Note that the feature was disabled 3 years ago so the code might be in a state that would allow re-implementing this feature. > Temporarily disable CLI command for rescaling > - > > Key: FLINK-12312 > URL: https://issues.apache.org/jira/browse/FLINK-12312 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Temporarily remove support to rescale job via CLI. See this thread for more > details: https://www.mail-archive.com/dev@flink.apache.org/msg25266.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-12312) Temporarily disable CLI command for rescaling
[ https://issues.apache.org/jira/browse/FLINK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-12312: - Description: Temporarily remove support to rescale job via CLI. See this thread for more details: https://www.mail-archive.com/dev@flink.apache.org/msg25266.html (was: Temporarily remove support to rescale job via CLI. See this thread for more details: https://lists.apache.org/thread/oby7fmz9crphonxw3l0g8b9zvybg3sno) > Temporarily disable CLI command for rescaling > - > > Key: FLINK-12312 > URL: https://issues.apache.org/jira/browse/FLINK-12312 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Temporarily remove support to rescale job via CLI. See this thread for more > details: https://www.mail-archive.com/dev@flink.apache.org/msg25266.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists
[ https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17463. Resolution: Fixed 1.11: 647f76283c900048e12361cf96d26db2a184b10b master: c22d01d3bfbb1384f98664361f1491b806e95798 > BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » > FileAlreadyExists > --- > > Key: FLINK-17463 > URL: https://issues.apache.org/jira/browse/FLINK-17463 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Robert Metzger >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > CI run: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d > {code} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 > s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest > [ERROR] > testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest) > Time elapsed: 2.028 s <<< ERROR! > java.nio.file.FileAlreadyExistsException: > /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) > at java.nio.file.Files.createDirectory(Files.java:674) > at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) > at java.nio.file.Files.createDirectories(Files.java:727) > at > org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222) > at > org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists
[ https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117735#comment-17117735 ] Gary Yao commented on FLINK-17463: -- This is most likely caused due to concurrent calls of {code} Files.createDirectories(...); {code} and {code} FileUtils.deleteDirectory(...); {code} with the same arguments. > BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » > FileAlreadyExists > --- > > Key: FLINK-17463 > URL: https://issues.apache.org/jira/browse/FLINK-17463 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Robert Metzger >Assignee: Gary Yao >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > CI run: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d > {code} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 > s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest > [ERROR] > testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest) > Time elapsed: 2.028 s <<< ERROR! > java.nio.file.FileAlreadyExistsException: > /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) > at java.nio.file.Files.createDirectory(Files.java:674) > at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) > at java.nio.file.Files.createDirectories(Files.java:727) > at > org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222) > at > org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists
[ https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17463: Assignee: Gary Yao > BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » > FileAlreadyExists > --- > > Key: FLINK-17463 > URL: https://issues.apache.org/jira/browse/FLINK-17463 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Robert Metzger >Assignee: Gary Yao >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > CI run: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d > {code} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 > s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest > [ERROR] > testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest) > Time elapsed: 2.028 s <<< ERROR! > java.nio.file.FileAlreadyExistsException: > /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) > at java.nio.file.Files.createDirectory(Files.java:674) > at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) > at java.nio.file.Files.createDirectories(Files.java:727) > at > org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222) > at > org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432) > at > org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-13553: Assignee: (was: Gary Yao) > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114792#comment-17114792 ] Gary Yao commented on FLINK-16468: -- [~longtimer] No problem, take care! > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114285#comment-17114285 ] Gary Yao commented on FLINK-13553: -- Added more debug/trace logs. 1.11: 7cbdd91413ee26d00d9015581ce2fa8538fd5963 3e18c109051821176575a15a6b10aaa5cc2e3e12 master: 564e8802a8f1a8c92d3c46686b109dfb826856fe 0cc7aae86dfdb5e51c661620d39caa79a16fd647 > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16649) Support Java 14
[ https://issues.apache.org/jira/browse/FLINK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16649: - Description: Resolve issues occurring when using Flink with Java 14. > Support Java 14 > --- > > Key: FLINK-16649 > URL: https://issues.apache.org/jira/browse/FLINK-16649 > Project: Flink > Issue Type: Sub-task > Components: Build System >Reporter: Chesnay Schepler >Priority: Major > > Resolve issues occurring when using Flink with Java 14. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16619) Misleading SlotManagerImpl logging for slot reports of unknown task manager
[ https://issues.apache.org/jira/browse/FLINK-16619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113922#comment-17113922 ] Gary Yao commented on FLINK-16619: -- The proposal sounds reasonable to me. > Misleading SlotManagerImpl logging for slot reports of unknown task manager > --- > > Key: FLINK-16619 > URL: https://issues.apache.org/jira/browse/FLINK-16619 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Chesnay Schepler >Priority: Major > > If the SlotManager receives a slot report from an unknown task manager it > logs 2 messages: > {code} > public boolean reportSlotStatus(InstanceID instanceId, SlotReport slotReport) > { > [...] > LOG.debug("Received slot report from instance {}: {}.", instanceId, > slotReport); > TaskManagerRegistration taskManagerRegistration = > taskManagerRegistrations.get(instanceId); > if (null != taskManagerRegistration) { > [...] > } else { > LOG.debug("Received slot report for unknown task manager with > instance id {}. Ignoring this report.", instanceId); > [...] > } > } > {code} > This leads to misleading output since it appears like the slot manager > received 2 separate slot reports, with the first being for a known instance, > the latter for an unknown one. This cost some time as I couldn't figure out > why the "latter" report was suddenly being rejected. > I propose moving the first debug message into the non-null branch. > [~trohrmann] WDYT? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113920#comment-17113920 ] Gary Yao commented on FLINK-16468: -- Are there any news [~longtimer]? > BlobClient rapid retrieval retries on failure opens too many sockets > > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE >Reporter: Jason Kania >Priority: Major > Fix For: 1.11.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16605) Add max limitation to the total number of slots
[ https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113840#comment-17113840 ] Gary Yao edited comment on FLINK-16605 at 5/22/20, 8:16 AM: [~karmagyz] Only in exceptional cases we add release notes for new features (see https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html). If you think it's justified, you can add a release note but it is at the discretion of the release manager to decide whether it will be included or not. was (Author: gjy): [~karmagyz] Only in exceptional cases we add release notes for new features (see https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html). If you think it's justified, you can add a release note but it is in the discretion of the release manager to decide whether it will be included or not. > Add max limitation to the total number of slots > --- > > Key: FLINK-16605 > URL: https://issues.apache.org/jira/browse/FLINK-16605 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit > to the total number of slots. > To be specific: > - Introduce "cluster.number-of-slots.max" configuration option with default > value MAX_INT > - Make the SlotManager respect the max number of slots, when exceeded, it > would not allocate resource anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16605) Add max limitation to the total number of slots
[ https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113840#comment-17113840 ] Gary Yao commented on FLINK-16605: -- [~karmagyz] Only in exceptional cases we add release notes for new features (see https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html). If you think it's justified, you can add a release note but it is in the discretion of the release manager to decide whether it will be included or not. > Add max limitation to the total number of slots > --- > > Key: FLINK-16605 > URL: https://issues.apache.org/jira/browse/FLINK-16605 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit > to the total number of slots. > To be specific: > - Introduce "cluster.number-of-slots.max" configuration option with default > value MAX_INT > - Make the SlotManager respect the max number of slots, when exceeded, it > would not allocate resource anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17794. Resolution: Fixed 1.11: 6a4714fdeff96d54db5fde5fac9b0eb355886b47 master: 2b2c574f102689b3cde9deac0bd1bcf78ad7ebc7 > Tear down installed software in reverse order in Jepsen Tests > - > > Key: FLINK-17794 > URL: https://issues.apache.org/jira/browse/FLINK-17794 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.10.1, 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Tear down installed software in reverse order in Jepsen tests. This mitigates > the issue that sometimes YARN's NodeManager directories cannot be removed > using {{rm -rf}} because Flink processes keep running and generate files > after the YARN NodeManager is shut down. {{rm -r}} removes files recursively > but if files are created in the background concurrently, the command can > still fail with a non-zero exit code. > {noformat} > sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot > remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': > Directory not empty\nrm: cannot remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': > Directory not empty > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-13553: - Comment: was deleted (was: Root cause in the new case is: {code} java.lang.IllegalStateException: Version Mismatch: Found 123238213, Expected: 2040641296. at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) ~[flink-core-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.queryablestate.network.messages.MessageSerializer.deserializeHeader(MessageSerializer.java:232) ~[flink-queryable-state-client-java-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.queryablestate.network.AbstractServerHandler.channelRead(AbstractServerHandler.java:110) [flink-queryable-state-client-java-1.11-SNAPSHOT.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.embedded.EmbeddedChannel.writeInbound(EmbeddedChannel.java:343) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.queryablestate.network.KvStateServerHandlerTest.testUnexpectedMessage(KvStateServerHandlerTest.java:491) [test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_242] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_242] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_242] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_242] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12] at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) [junit-4.12.jar:4.12] at org.junit.rules.RunRules.evaluate(RunRules.java:20) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.1
[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1724#comment-1724 ] Gary Yao commented on FLINK-13553: -- Root cause in the new case is: {code} java.lang.IllegalStateException: Version Mismatch: Found 123238213, Expected: 2040641296. at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) ~[flink-core-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.queryablestate.network.messages.MessageSerializer.deserializeHeader(MessageSerializer.java:232) ~[flink-queryable-state-client-java-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.queryablestate.network.AbstractServerHandler.channelRead(AbstractServerHandler.java:110) [flink-queryable-state-client-java-1.11-SNAPSHOT.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.shaded.netty4.io.netty.channel.embedded.EmbeddedChannel.writeInbound(EmbeddedChannel.java:343) [flink-shaded-netty-4.1.39.Final-10.0.jar:?] at org.apache.flink.queryablestate.network.KvStateServerHandlerTest.testUnexpectedMessage(KvStateServerHandlerTest.java:491) [test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_242] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_242] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_242] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_242] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12] at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) [junit-4.12.jar:4.12] at org.junit.rules.RunRules.evaluate(RunRules.java:20) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.jav
[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17194: Assignee: Gary Yao > TPC-DS end-to-end test fails due to missing execution attempt > - > > Key: FLINK-17194 > URL: https://issues.apache.org/jira/browse/FLINK-17194 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Chesnay Schepler >Assignee: Gary Yao >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > {code:java} > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt d6bef26867c04f1c94903b06b60ec55f was not found. > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17194: Assignee: (was: Gary Yao) > TPC-DS end-to-end test fails due to missing execution attempt > - > > Key: FLINK-17194 > URL: https://issues.apache.org/jira/browse/FLINK-17194 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Chesnay Schepler >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > {code:java} > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt d6bef26867c04f1c94903b06b60ec55f was not found. > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17194: Assignee: (was: Gary Yao) > TPC-DS end-to-end test fails due to missing execution attempt > - > > Key: FLINK-17194 > URL: https://issues.apache.org/jira/browse/FLINK-17194 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Chesnay Schepler >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > {code:java} > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt d6bef26867c04f1c94903b06b60ec55f was not found. > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-13553: Assignee: Gary Yao > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17687. Resolution: Fixed 1.11: 656d56e99e3c158c7252db04bc034cce77ad39ba aa2a5709309ef8149607cc6ac696cd990a8aef81 2aa45cdcc88d75837df165fcde71200d796deee7 master: be8c02e397943d668c4ff64e4c491a560136e2e1 12d662c9da2dc3e18fdd3d752ddeeb07df1f5945 ed74173c087fe879f5728b810e204bafb69bdae6 > Collect TaskManager logs in Mesos Jepsen Tests > -- > > Key: FLINK-17687 > URL: https://issues.apache.org/jira/browse/FLINK-17687 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > TM logs are collected in standalone mode and YARN. However, for Mesos tests, > TM logs are not collected at the end of the test. We should download all log > files generated in the in the mesos agent directories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17687: - Description: TM logs are collected in standalone mode and YARN. However, for Mesos tests, TM logs are not collected at the end of the test. We should download all log files generated in the in the mesos agent directories. > Collect TaskManager logs in Mesos Jepsen Tests > -- > > Key: FLINK-17687 > URL: https://issues.apache.org/jira/browse/FLINK-17687 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > TM logs are collected in standalone mode and YARN. However, for Mesos tests, > TM logs are not collected at the end of the test. We should download all log > files generated in the in the mesos agent directories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink
[ https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-1: - Fix Version/s: 1.11.0 > Make Mesos Jepsen Tests pass with Hadoop-free Flink > > > Key: FLINK-1 > URL: https://issues.apache.org/jira/browse/FLINK-1 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. > Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for > the TM processes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink
[ https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-1. Resolution: Fixed 1.11: f46735cb4963af616c0e8538331bed8739a1d353 master: 81ffe8a271a3d4bf7867f7b8b75ffc4cc6707d85 > Make Mesos Jepsen Tests pass with Hadoop-free Flink > > > Key: FLINK-1 > URL: https://issues.apache.org/jira/browse/FLINK-1 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > > Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. > Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for > the TM processes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17687: - Issue Type: Bug (was: Improvement) > Collect TaskManager logs in Mesos Jepsen Tests > -- > > Key: FLINK-17687 > URL: https://issues.apache.org/jira/browse/FLINK-17687 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17792. Resolution: Fixed 1.11: d8a77cbf93007bf970963a4499aa06501c0d9808 master: 417936d8722c7b466f22bc13b9063e7298e0cbd6 > Failing to invoking jstack on TM processes should not fail Jepsen Tests > --- > > Key: FLINK-17792 > URL: https://issues.apache.org/jira/browse/FLINK-17792 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.10.1, 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > {{jstack}} can fail if the JVM process exits prematurely while or before we > invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits > the Jepsen Tests prematurely. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110192#comment-17110192 ] Gary Yao commented on FLINK-17676: -- [~rmetzger] I think Command Line Client is not the right component. > Is there some way to rollback the .out file of TaskManager > -- > > Key: FLINK-17676 > URL: https://issues.apache.org/jira/browse/FLINK-17676 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Reporter: JieFang.He >Priority: Major > > When use .print() API, the result all write to the out file, But there is no > way to rollback the out file. > > out in flink-daemon.sh > {code:java} > // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath > "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" > ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null & > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15813) Set default value of jobmanager.execution.failover-strategy to region
[ https://issues.apache.org/jira/browse/FLINK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110130#comment-17110130 ] Gary Yao commented on FLINK-15813: -- This issue is probably only for the documentation page and doesn't require changing tests. {{FailoverStrategyFactoryLoader}} already uses "region" as the default value: {code} // the default NG failover strategy is the region failover strategy. // TODO: Remove the overridden default value when removing legacy scheduler // and change the default value of JobManagerOptions.EXECUTION_FAILOVER_STRATEGY // to be "region" final String strategyParam = config.getString( JobManagerOptions.EXECUTION_FAILOVER_STRATEGY, PIPELINED_REGION_RESTART_STRATEGY_NAME); {code} > Set default value of jobmanager.execution.failover-strategy to region > - > > Key: FLINK-15813 > URL: https://issues.apache.org/jira/browse/FLINK-15813 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Till Rohrmann >Assignee: Zhu Zhu >Priority: Blocker > Labels: pull-request-available, usability > Fix For: 1.11.0 > > > We should set the default value of {{jobmanager.execution.failover-strategy}} > to {{region}}. This might require to adapt existing tests to make them pass. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Logfiles and enable reading Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16303: - Release Note: Requesting an unavailable log or stdout file from the JobManager's HTTP server returns status code 404 now. In previous releases, the HTTP server would return a file with '(file unavailable)' as its content. > Add Rest Handler to list JM Logfiles and enable reading Logs by Filename > > > Key: FLINK-16303 > URL: https://issues.apache.org/jira/browse/FLINK-16303 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 40m > Remaining Estimate: 0h > > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/log/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16303) Add REST Log List and enable reading Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16303: - Summary: Add REST Log List and enable reading Logs by Filename (was: Add Rest Handler to list JM Log Files and enable reading Logs by Filename) > Add REST Log List and enable reading Logs by Filename > - > > Key: FLINK-16303 > URL: https://issues.apache.org/jira/browse/FLINK-16303 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 40m > Remaining Estimate: 0h > > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/log/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Log Files and enable reading Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16303: - Summary: Add Rest Handler to list JM Log Files and enable reading Logs by Filename (was: Add JobMananger Log List and enable reading Logs by Filename) > Add Rest Handler to list JM Log Files and enable reading Logs by Filename > - > > Key: FLINK-16303 > URL: https://issues.apache.org/jira/browse/FLINK-16303 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 40m > Remaining Estimate: 0h > > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/log/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Logfiles and enable reading Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16303: - Summary: Add Rest Handler to list JM Logfiles and enable reading Logs by Filename (was: Add REST Log List and enable reading Logs by Filename) > Add Rest Handler to list JM Logfiles and enable reading Logs by Filename > > > Key: FLINK-16303 > URL: https://issues.apache.org/jira/browse/FLINK-16303 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 40m > Remaining Estimate: 0h > > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/log/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16303) Add JobMananger Log List and enable reading Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16303: - Summary: Add JobMananger Log List and enable reading Logs by Filename (was: add log list and read log by name for jobmanager) > Add JobMananger Log List and enable reading Logs by Filename > > > Key: FLINK-16303 > URL: https://issues.apache.org/jira/browse/FLINK-16303 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 40m > Remaining Estimate: 0h > > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/log/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-13987) Better TM/JM Log Display
[ https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-13987: - Summary: Better TM/JM Log Display (was: add log list and read log by name) > Better TM/JM Log Display > > > Key: FLINK-13987 > URL: https://issues.apache.org/jira/browse/FLINK-13987 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Fix For: 1.11.0 > > > As the job running, the log files are becoming large. > As the application runs on JVM, sometimes the user needs to see the log of > GC, but there isn't this content. > Above all, we need new apis: > * list taskmanager all log file > ** /taskmanagers/taskmanagerid/logs > ** > {code:java} > { > "logs": [ > { > "name": "taskmanager.log", > "size": 12529 > } > ] > } {code} > * read taskmanager log file > ** /taskmanagers/logs/[filename] > ** response: same as taskmanager’s log > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/logs/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-13987) add log list and read log by name
[ https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-13987. Resolution: Fixed > add log list and read log by name > - > > Key: FLINK-13987 > URL: https://issues.apache.org/jira/browse/FLINK-13987 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Fix For: 1.11.0 > > > As the job running, the log files are becoming large. > As the application runs on JVM, sometimes the user needs to see the log of > GC, but there isn't this content. > Above all, we need new apis: > * list taskmanager all log file > ** /taskmanagers/taskmanagerid/logs > ** > {code:java} > { > "logs": [ > { > "name": "taskmanager.log", > "size": 12529 > } > ] > } {code} > * read taskmanager log file > ** /taskmanagers/logs/[filename] > ** response: same as taskmanager’s log > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/logs/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-13987) add log list and read log by name
[ https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-13987: - Fix Version/s: 1.11.0 > add log list and read log by name > - > > Key: FLINK-13987 > URL: https://issues.apache.org/jira/browse/FLINK-13987 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Fix For: 1.11.0 > > > As the job running, the log files are becoming large. > As the application runs on JVM, sometimes the user needs to see the log of > GC, but there isn't this content. > Above all, we need new apis: > * list taskmanager all log file > ** /taskmanagers/taskmanagerid/logs > ** > {code:java} > { > "logs": [ > { > "name": "taskmanager.log", > "size": 12529 > } > ] > } {code} > * read taskmanager log file > ** /taskmanagers/logs/[filename] > ** response: same as taskmanager’s log > * list jobmanager all log file > ** /jobmanager/logs > ** > {code:java} > { > "logs": [ > { > "name": "jobmanager.log", > "size": 12529 > } > ] > }{code} > * read jobmanager log file > ** /jobmanager/logs/[filename] > ** response: same as jobmanager's log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16863) Sorting descendingly on the last modified date of LogInfo
[ https://issues.apache.org/jira/browse/FLINK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16863: - Parent: (was: FLINK-13987) Issue Type: Improvement (was: Sub-task) > Sorting descendingly on the last modified date of LogInfo > - > > Key: FLINK-16863 > URL: https://issues.apache.org/jira/browse/FLINK-16863 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST >Affects Versions: 1.11.0 >Reporter: lining >Priority: Major > > Sorting descendingly on the last modified date could a user be able to see > the most recent files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16863) Sorting descendingly on the last modified date of LogInfo
[ https://issues.apache.org/jira/browse/FLINK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-16863: - Affects Version/s: 1.11.0 > Sorting descendingly on the last modified date of LogInfo > - > > Key: FLINK-16863 > URL: https://issues.apache.org/jira/browse/FLINK-16863 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Affects Versions: 1.11.0 >Reporter: lining >Priority: Major > > Sorting descendingly on the last modified date could a user be able to see > the most recent files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17794: - Description: Tear down installed software in reverse order in Jepsen tests. This mitigates the issue that sometimes YARN's NodeManager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} was: Tear down installed software in reverse order in Jepsen Tests. This mitigates the issue that sometimes YARN's NodeManager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} > Tear down installed software in reverse order in Jepsen Tests > - > > Key: FLINK-17794 > URL: https://issues.apache.org/jira/browse/FLINK-17794 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.10.1, 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Fix For: 1.11.0 > > > Tear down installed software in reverse order in Jepsen tests. This mitigates > the issue that sometimes YARN's NodeManager directories cannot be removed > using {{rm -rf}} because Flink processes keep running and generate files > after the YARN NodeManager is shut down. {{rm -r}} removes files recursively > but if files are created in the background concurrently, the command can > still fail with a non-zero exit code. > {noformat} > sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot > remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': > Directory not empty\nrm: cannot remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': > Directory not empty > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests
[ https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17794: - Description: Tear down installed software in reverse order in Jepsen Tests. This mitigates the issue that sometimes YARN's NodeManager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} was: Tear down installed software in reverse order in Jepsen Tests. This mitigates the issue that sometimes hadoop's node manager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} > Tear down installed software in reverse order in Jepsen Tests > - > > Key: FLINK-17794 > URL: https://issues.apache.org/jira/browse/FLINK-17794 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.10.1, 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Fix For: 1.11.0 > > > Tear down installed software in reverse order in Jepsen Tests. This mitigates > the issue that sometimes YARN's NodeManager directories cannot be removed > using {{rm -rf}} because Flink processes keep running and generate files > after the YARN NodeManager is shut down. {{rm -r}} removes files recursively > but if files are created in the background concurrently, the command can > still fail with a non-zero exit code. > {noformat} > sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot > remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': > Directory not empty\nrm: cannot remove > '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': > Directory not empty > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests
Gary Yao created FLINK-17794: Summary: Tear down installed software in reverse order in Jepsen Tests Key: FLINK-17794 URL: https://issues.apache.org/jira/browse/FLINK-17794 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.1, 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Tear down installed software in reverse order in Jepsen Tests. This mitigates the issue that sometimes hadoop's node manager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests
Gary Yao created FLINK-17792: Summary: Failing to invoking jstack on TM processes should not fail Jepsen Tests Key: FLINK-17792 URL: https://issues.apache.org/jira/browse/FLINK-17792 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.1, 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 {{jstack}} can fail if the JVM process exits prematurely while or before we invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits the Jepsen Tests prematurely. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method
[ https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reopened FLINK-17595: -- > JobExceptionsInfo. ExecutionExceptionInfo miss getter method > > > Key: FLINK-17595 > URL: https://issues.apache.org/jira/browse/FLINK-17595 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0 >Reporter: Wei Zhang >Priority: Minor > Fix For: 1.11.0 > > > {code:java} > public static final class ExecutionExceptionInfo { > public static final String FIELD_NAME_EXCEPTION = "exception"; > public static final String FIELD_NAME_TASK = "task"; > public static final String FIELD_NAME_LOCATION = "location"; > public static final String FIELD_NAME_TIMESTAMP = "timestamp"; > @JsonProperty(FIELD_NAME_EXCEPTION) > private final String exception; > @JsonProperty(FIELD_NAME_TASK) > private final String task; > @JsonProperty(FIELD_NAME_LOCATION) > private final String location; > @JsonProperty(FIELD_NAME_TIMESTAMP) > private final long timestamp; > @JsonCreator > public ExecutionExceptionInfo( > @JsonProperty(FIELD_NAME_EXCEPTION) String exception, > @JsonProperty(FIELD_NAME_TASK) String task, > @JsonProperty(FIELD_NAME_LOCATION) String location, > @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) { > this.exception = Preconditions.checkNotNull(exception); > this.task = Preconditions.checkNotNull(task); > this.location = Preconditions.checkNotNull(location); > this.timestamp = timestamp; > } > @Override > public boolean equals(Object o) { > if (this == o) { > return true; > } > if (o == null || getClass() != o.getClass()) { > return false; > } > JobExceptionsInfo.ExecutionExceptionInfo that = > (JobExceptionsInfo.ExecutionExceptionInfo) o; > return timestamp == that.timestamp && > Objects.equals(exception, that.exception) && > Objects.equals(task, that.task) && > Objects.equals(location, that.location); > } > @Override > public int hashCode() { > return Objects.hash(timestamp, exception, task, > location); > } > {code} > I found jobexceptionsinfo.executionexceptioninfo has no getter method for the > field, is it missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method
[ https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17595. Resolution: Won't Do > JobExceptionsInfo. ExecutionExceptionInfo miss getter method > > > Key: FLINK-17595 > URL: https://issues.apache.org/jira/browse/FLINK-17595 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0 >Reporter: Wei Zhang >Priority: Minor > Fix For: 1.11.0 > > > {code:java} > public static final class ExecutionExceptionInfo { > public static final String FIELD_NAME_EXCEPTION = "exception"; > public static final String FIELD_NAME_TASK = "task"; > public static final String FIELD_NAME_LOCATION = "location"; > public static final String FIELD_NAME_TIMESTAMP = "timestamp"; > @JsonProperty(FIELD_NAME_EXCEPTION) > private final String exception; > @JsonProperty(FIELD_NAME_TASK) > private final String task; > @JsonProperty(FIELD_NAME_LOCATION) > private final String location; > @JsonProperty(FIELD_NAME_TIMESTAMP) > private final long timestamp; > @JsonCreator > public ExecutionExceptionInfo( > @JsonProperty(FIELD_NAME_EXCEPTION) String exception, > @JsonProperty(FIELD_NAME_TASK) String task, > @JsonProperty(FIELD_NAME_LOCATION) String location, > @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) { > this.exception = Preconditions.checkNotNull(exception); > this.task = Preconditions.checkNotNull(task); > this.location = Preconditions.checkNotNull(location); > this.timestamp = timestamp; > } > @Override > public boolean equals(Object o) { > if (this == o) { > return true; > } > if (o == null || getClass() != o.getClass()) { > return false; > } > JobExceptionsInfo.ExecutionExceptionInfo that = > (JobExceptionsInfo.ExecutionExceptionInfo) o; > return timestamp == that.timestamp && > Objects.equals(exception, that.exception) && > Objects.equals(task, that.task) && > Objects.equals(location, that.location); > } > @Override > public int hashCode() { > return Objects.hash(timestamp, exception, task, > location); > } > {code} > I found jobexceptionsinfo.executionexceptioninfo has no getter method for the > field, is it missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method
[ https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17595: - Fix Version/s: (was: 1.11.0) > JobExceptionsInfo. ExecutionExceptionInfo miss getter method > > > Key: FLINK-17595 > URL: https://issues.apache.org/jira/browse/FLINK-17595 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0 >Reporter: Wei Zhang >Priority: Minor > > {code:java} > public static final class ExecutionExceptionInfo { > public static final String FIELD_NAME_EXCEPTION = "exception"; > public static final String FIELD_NAME_TASK = "task"; > public static final String FIELD_NAME_LOCATION = "location"; > public static final String FIELD_NAME_TIMESTAMP = "timestamp"; > @JsonProperty(FIELD_NAME_EXCEPTION) > private final String exception; > @JsonProperty(FIELD_NAME_TASK) > private final String task; > @JsonProperty(FIELD_NAME_LOCATION) > private final String location; > @JsonProperty(FIELD_NAME_TIMESTAMP) > private final long timestamp; > @JsonCreator > public ExecutionExceptionInfo( > @JsonProperty(FIELD_NAME_EXCEPTION) String exception, > @JsonProperty(FIELD_NAME_TASK) String task, > @JsonProperty(FIELD_NAME_LOCATION) String location, > @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) { > this.exception = Preconditions.checkNotNull(exception); > this.task = Preconditions.checkNotNull(task); > this.location = Preconditions.checkNotNull(location); > this.timestamp = timestamp; > } > @Override > public boolean equals(Object o) { > if (this == o) { > return true; > } > if (o == null || getClass() != o.getClass()) { > return false; > } > JobExceptionsInfo.ExecutionExceptionInfo that = > (JobExceptionsInfo.ExecutionExceptionInfo) o; > return timestamp == that.timestamp && > Objects.equals(exception, that.exception) && > Objects.equals(task, that.task) && > Objects.equals(location, that.location); > } > @Override > public int hashCode() { > return Objects.hash(timestamp, exception, task, > location); > } > {code} > I found jobexceptionsinfo.executionexceptioninfo has no getter method for the > field, is it missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink
[ https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-1: - Description: Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for the TM processes. > Make Mesos Jepsen Tests pass with Hadoop-free Flink > > > Key: FLINK-1 > URL: https://issues.apache.org/jira/browse/FLINK-1 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > > Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. > Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for > the TM processes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17522) Document flink-jepsen Command Line Options
[ https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17522: - Description: Document command line options that can be passed to {{lein run test}}. > Document flink-jepsen Command Line Options > -- > > Key: FLINK-17522 > URL: https://issues.apache.org/jira/browse/FLINK-17522 > Project: Flink > Issue Type: Improvement > Components: Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Document command line options that can be passed to {{lein run test}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17522) Document flink-jepsen Command Line Options
[ https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17522. Resolution: Fixed master: b606cbaf36f9e206f44243dfdc7e8005e92d2d66 > Document flink-jepsen Command Line Options > -- > > Key: FLINK-17522 > URL: https://issues.apache.org/jira/browse/FLINK-17522 > Project: Flink > Issue Type: Improvement > Components: Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Document command line options that can be passed to {{lein run test}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17522) Document flink-jepsen Command Line Options
[ https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17522: - Fix Version/s: 1.11.0 > Document flink-jepsen Command Line Options > -- > > Key: FLINK-17522 > URL: https://issues.apache.org/jira/browse/FLINK-17522 > Project: Flink > Issue Type: Improvement > Components: Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink
Gary Yao created FLINK-1: Summary: Make Mesos Jepsen Tests pass with Hadoop-free Flink Key: FLINK-1 URL: https://issues.apache.org/jira/browse/FLINK-1 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107277#comment-17107277 ] Gary Yao commented on FLINK-17676: -- [~hejiefang] Can you use [logrotate|https://linux.die.net/man/8/logrotate]? > Is there some way to rollback the .out file of TaskManager > -- > > Key: FLINK-17676 > URL: https://issues.apache.org/jira/browse/FLINK-17676 > Project: Flink > Issue Type: Improvement >Reporter: JieFang.He >Priority: Major > > When use .print() API, the result all write to the out file, But there is no > way to rollback the out file. > > out in flink-daemon.sh > {code:java} > // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath > "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" > ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null & > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests
Gary Yao created FLINK-17687: Summary: Collect TaskManager logs in Mesos Jepsen Tests Key: FLINK-17687 URL: https://issues.apache.org/jira/browse/FLINK-17687 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107033#comment-17107033 ] Gary Yao commented on FLINK-17676: -- You are writing "rollback" but I suspect you could mean to roll the .out file, i.e., close and archive the current .out file if it is exceeds a size and start writing into a new one. > Is there some way to rollback the .out file of TaskManager > -- > > Key: FLINK-17676 > URL: https://issues.apache.org/jira/browse/FLINK-17676 > Project: Flink > Issue Type: Improvement >Reporter: JieFang.He >Priority: Major > > When use .print() API, the result all write to the out file, But there is no > way to rollback the out file. > > out in flink-daemon.sh > {code:java} > // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath > "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" > ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null & > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
[ https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17616. Resolution: Fixed master: 56ea8d15dee58d2a79b6b9c646a8bfb2cb9f0c23 > Temporarily increase akka.ask.timeout in TPC-DS e2e test > > > Key: FLINK-17616 > URL: https://issues.apache.org/jira/browse/FLINK-17616 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.0 > > > Until FLINK-17558 is fixed, we should increase the akka.ask.timeout in the > e2e test to mitigate FLINK-17194 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method
[ https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105322#comment-17105322 ] Gary Yao commented on FLINK-17595: -- The REST API, i.e., the specification of the JSON requests and responses, is a public API and guaranteed to be stable. However, the Java classes used to implement the REST API are not part of the public API (not annotated with {{@Public}} or {{@PublicEvolving}}). Even if we added a getter to {{ExecutionExceptionInfo}}, we would not guarantee that the class won't be renamed or moved to a different package in a next Flink release. Moreover, the {{RestClusterClient}} is not part of the public API; {{RestClusterClient#sendRequest()}} is even annotated with {{@VisibleForTesting}}. The problem seems to be that in the Flink project there is not yet a reference client implementation to programmatically interact with the REST API. All in all, I am currently against adding a getter to {{ExecutionExceptionInfo}}. What you can do in the meantime as a workaround is: * Access the required field by reflection * Copy the class and add a getter yourself * Implement your own client from scratch > JobExceptionsInfo. ExecutionExceptionInfo miss getter method > > > Key: FLINK-17595 > URL: https://issues.apache.org/jira/browse/FLINK-17595 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0 >Reporter: Wei Zhang >Priority: Minor > Fix For: 1.11.0 > > > {code:java} > public static final class ExecutionExceptionInfo { > public static final String FIELD_NAME_EXCEPTION = "exception"; > public static final String FIELD_NAME_TASK = "task"; > public static final String FIELD_NAME_LOCATION = "location"; > public static final String FIELD_NAME_TIMESTAMP = "timestamp"; > @JsonProperty(FIELD_NAME_EXCEPTION) > private final String exception; > @JsonProperty(FIELD_NAME_TASK) > private final String task; > @JsonProperty(FIELD_NAME_LOCATION) > private final String location; > @JsonProperty(FIELD_NAME_TIMESTAMP) > private final long timestamp; > @JsonCreator > public ExecutionExceptionInfo( > @JsonProperty(FIELD_NAME_EXCEPTION) String exception, > @JsonProperty(FIELD_NAME_TASK) String task, > @JsonProperty(FIELD_NAME_LOCATION) String location, > @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) { > this.exception = Preconditions.checkNotNull(exception); > this.task = Preconditions.checkNotNull(task); > this.location = Preconditions.checkNotNull(location); > this.timestamp = timestamp; > } > @Override > public boolean equals(Object o) { > if (this == o) { > return true; > } > if (o == null || getClass() != o.getClass()) { > return false; > } > JobExceptionsInfo.ExecutionExceptionInfo that = > (JobExceptionsInfo.ExecutionExceptionInfo) o; > return timestamp == that.timestamp && > Objects.equals(exception, that.exception) && > Objects.equals(task, that.task) && > Objects.equals(location, that.location); > } > @Override > public int hashCode() { > return Objects.hash(timestamp, exception, task, > location); > } > {code} > I found jobexceptionsinfo.executionexceptioninfo has no getter method for the > field, is it missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"
[ https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17536: - Affects Version/s: 1.11.0 > Change the config option of slot max limitation to > "slotmanager.number-of-slots.max" > > > Key: FLINK-17536 > URL: https://issues.apache.org/jira/browse/FLINK-17536 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Affects Versions: 1.11.0 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"
[ https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17536. Resolution: Fixed master: 9888a87495827b619f5b49dae5ad29a34931d0a9 > Change the config option of slot max limitation to > "slotmanager.number-of-slots.max" > > > Key: FLINK-17536 > URL: https://issues.apache.org/jira/browse/FLINK-17536 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Affects Versions: 1.11.0 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17130) Web UI: Enable listing JM Logs and displaying Logs by Filename
[ https://issues.apache.org/jira/browse/FLINK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17130. Resolution: Fixed master: 74b850cd4c6aeac5dd0d20852677d642c9703970 > Web UI: Enable listing JM Logs and displaying Logs by Filename > -- > > Key: FLINK-17130 > URL: https://issues.apache.org/jira/browse/FLINK-17130 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Web Frontend >Affects Versions: 1.11.0 >Reporter: Yadong Xie >Assignee: Yadong Xie >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > add log list and read log by name for jobmanager in the web -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17608) Add TM log and stdout page back
[ https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17608. Resolution: Fixed master: 7cfcd33e983c6e07eedf8c0d5514450a565710ff > Add TM log and stdout page back > --- > > Key: FLINK-17608 > URL: https://issues.apache.org/jira/browse/FLINK-17608 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Web Frontend >Affects Versions: 1.11.0 >Reporter: Yadong Xie >Assignee: Yadong Xie >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > According to the discussion in > [https://github.com/apache/flink/pull/11731#issuecomment-620048458] > TM log and stdout page should be added in order not to break the previous > user experience. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17608) Add TM log and stdout page back
[ https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17608: - Description: According to the discussion in [https://github.com/apache/flink/pull/11731#issuecomment-620048458] TM log and stdout page should be added in order not to break the previous user experience. was: According to the discussion in [https://github.com/apache/flink/pull/11731#issuecomment-620048458] TM log and stdout page should be added in order not to break the previous user exp > Add TM log and stdout page back > --- > > Key: FLINK-17608 > URL: https://issues.apache.org/jira/browse/FLINK-17608 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Web Frontend >Affects Versions: 1.11.0 >Reporter: Yadong Xie >Assignee: Yadong Xie >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > According to the discussion in > [https://github.com/apache/flink/pull/11731#issuecomment-620048458] > TM log and stdout page should be added in order not to break the previous > user experience. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17621) Use default akka.ask.timeout in TPC-DS e2e test
Gary Yao created FLINK-17621: Summary: Use default akka.ask.timeout in TPC-DS e2e test Key: FLINK-17621 URL: https://issues.apache.org/jira/browse/FLINK-17621 Project: Flink Issue Type: Task Components: Runtime / Coordination, Tests Affects Versions: 1.11.0 Reporter: Gary Yao Revert the changes in FLINK-17616 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"
[ https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104372#comment-17104372 ] Gary Yao commented on FLINK-17536: -- Sorry, missed your message. I have assigned you now. > Change the config option of slot max limitation to > "slotmanager.number-of-slots.max" > > > Key: FLINK-17536 > URL: https://issues.apache.org/jira/browse/FLINK-17536 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"
[ https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17536: Assignee: Yangze Guo (was: Gary Yao) > Change the config option of slot max limitation to > "slotmanager.number-of-slots.max" > > > Key: FLINK-17536 > URL: https://issues.apache.org/jira/browse/FLINK-17536 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"
[ https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17536: Assignee: Gary Yao > Change the config option of slot max limitation to > "slotmanager.number-of-slots.max" > > > Key: FLINK-17536 > URL: https://issues.apache.org/jira/browse/FLINK-17536 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Reporter: Yangze Guo >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
[ https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17616: - Description: Until FLINK-17558 is fixed, we should > Temporarily increase akka.ask.timeout in TPC-DS e2e test > > > Key: FLINK-17616 > URL: https://issues.apache.org/jira/browse/FLINK-17616 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Fix For: 1.11.0 > > > Until FLINK-17558 is fixed, we should -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
[ https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17616: - Priority: Critical (was: Major) > Temporarily increase akka.ask.timeout in TPC-DS e2e test > > > Key: FLINK-17616 > URL: https://issues.apache.org/jira/browse/FLINK-17616 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
[ https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17616: - Description: Until FLINK-17558 is fixed, we should increase the akka.ask.timeout in the e2e test to mitigate FLINK-17194 (was: Until FLINK-17558 is fixed, we should ) > Temporarily increase akka.ask.timeout in TPC-DS e2e test > > > Key: FLINK-17616 > URL: https://issues.apache.org/jira/browse/FLINK-17616 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Fix For: 1.11.0 > > > Until FLINK-17558 is fixed, we should increase the akka.ask.timeout in the > e2e test to mitigate FLINK-17194 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
Gary Yao created FLINK-17616: Summary: Temporarily increase akka.ask.timeout in TPC-DS e2e test Key: FLINK-17616 URL: https://issues.apache.org/jira/browse/FLINK-17616 Project: Flink Issue Type: Task Components: Runtime / Coordination, Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest
[ https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17369. Resolution: Fixed master: cba48459172f3888766c6ec753f3c45c6cd1d884 56cc76bbdaf6380781c118ad2e5d4fbfeca510ac fd0ef6e672b5ac2f7cbd01fc5704e9e06c748016 > Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to > PipelinedRegionComputeUtilTest > > > Key: FLINK-17369 > URL: https://issues.apache.org/jira/browse/FLINK-17369 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually > testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests > should be moved to a new class {{PipelinedRegionComputeUtilTest}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17485) Add a thread dump REST API
[ https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17485. Resolution: Duplicate Closing because it is a duplicate. Feel free to re-open if you think otherwise. > Add a thread dump REST API > -- > > Key: FLINK-17485 > URL: https://issues.apache.org/jira/browse/FLINK-17485 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST >Reporter: Xingxing Di >Priority: Major > > My team build a streaming computing platform based on flink in our company > internal. > As jobs and users grow, we spent lot's of time to help user with > troubleshooting. > Currently we must logon the server which running task manager, find the right > process through netstat -anp| grep "the flink data port", then run jstack > command. > We think it will be very convenient if flink provide a REST API for thread > dumping, with web UI support event better. > So we want to know: > * If community is already working on this > * Will this be a appropriate feature (add a REST API to dump threads), > because on the other hand, thread dump may be "expensive" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method
[ https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104196#comment-17104196 ] Gary Yao commented on FLINK-17595: -- [~zhangwei24] Can you explain why you need a getter? This class isn't public API. > JobExceptionsInfo. ExecutionExceptionInfo miss getter method > > > Key: FLINK-17595 > URL: https://issues.apache.org/jira/browse/FLINK-17595 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0 >Reporter: Wei Zhang >Priority: Minor > Fix For: 1.11.0 > > > {code:java} > public static final class ExecutionExceptionInfo { > public static final String FIELD_NAME_EXCEPTION = "exception"; > public static final String FIELD_NAME_TASK = "task"; > public static final String FIELD_NAME_LOCATION = "location"; > public static final String FIELD_NAME_TIMESTAMP = "timestamp"; > @JsonProperty(FIELD_NAME_EXCEPTION) > private final String exception; > @JsonProperty(FIELD_NAME_TASK) > private final String task; > @JsonProperty(FIELD_NAME_LOCATION) > private final String location; > @JsonProperty(FIELD_NAME_TIMESTAMP) > private final long timestamp; > @JsonCreator > public ExecutionExceptionInfo( > @JsonProperty(FIELD_NAME_EXCEPTION) String exception, > @JsonProperty(FIELD_NAME_TASK) String task, > @JsonProperty(FIELD_NAME_LOCATION) String location, > @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) { > this.exception = Preconditions.checkNotNull(exception); > this.task = Preconditions.checkNotNull(task); > this.location = Preconditions.checkNotNull(location); > this.timestamp = timestamp; > } > @Override > public boolean equals(Object o) { > if (this == o) { > return true; > } > if (o == null || getClass() != o.getClass()) { > return false; > } > JobExceptionsInfo.ExecutionExceptionInfo that = > (JobExceptionsInfo.ExecutionExceptionInfo) o; > return timestamp == that.timestamp && > Objects.equals(exception, that.exception) && > Objects.equals(task, that.task) && > Objects.equals(location, that.location); > } > @Override > public int hashCode() { > return Objects.hash(timestamp, exception, task, > location); > } > {code} > I found jobexceptionsinfo.executionexceptioninfo has no getter method for the > field, is it missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17608) Add TM log and stdout page back
[ https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17608: Assignee: Yadong Xie > Add TM log and stdout page back > --- > > Key: FLINK-17608 > URL: https://issues.apache.org/jira/browse/FLINK-17608 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Web Frontend >Affects Versions: 1.11.0 >Reporter: Yadong Xie >Assignee: Yadong Xie >Priority: Major > Fix For: 1.11.0 > > > According to the discussion in > [https://github.com/apache/flink/pull/11731#issuecomment-620048458] > TM log and stdout page should be added in order not to break the previous > user exp -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101970#comment-17101970 ] Gary Yao commented on FLINK-17194: -- My theory is that we exhaust the IOPS credits towards the end of the tests and file I/O becomes really slow. Nonetheless, partitions should not be released in the main thread. I have created FLINK-17558 to track that issue. As a mitigation we could temporarily increase the akka ask timeout. > TPC-DS end-to-end test fails due to missing execution attempt > - > > Key: FLINK-17194 > URL: https://issues.apache.org/jira/browse/FLINK-17194 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Chesnay Schepler >Assignee: Gary Yao >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > {code:java} > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt d6bef26867c04f1c94903b06b60ec55f was not found. > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17558) Partitions are released in TaskExecutor Main Thread
[ https://issues.apache.org/jira/browse/FLINK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17558: - Description: Partitions are released in the main thread of the TaskExecutor (see the stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. because deleting files is blocking I/O. The partitions should be released in a devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate but requires a higher default thread count). {noformat} 2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable [0x7f7fd302c000] 2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE 2020-05-06T19:13:12.4384519Zat sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) 2020-05-06T19:13:12.4384971Zat sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146) 2020-05-06T19:13:12.4385465Zat sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231) 2020-05-06T19:13:12.4386000Zat sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) 2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126) 2020-05-06T19:13:12.4386968Zat org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93) 2020-05-06T19:13:12.4388088Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247) 2020-05-06T19:13:12.4388765Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208) 2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a java.lang.Object) 2020-05-06T19:13:12.4389905Zat org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290) 2020-05-06T19:13:12.4390481Zat org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80) 2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a java.util.HashMap) 2020-05-06T19:13:12.4391597Zat org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153) 2020-05-06T19:13:12.4392267Zat org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62) 2020-05-06T19:13:12.4392914Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776) 2020-05-06T19:13:12.4393366Zat sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 2020-05-06T19:13:12.4393813Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-06T19:13:12.4394257Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-06T19:13:12.4394693Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279) 2020-05-06T19:13:12.4395202Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) 2020-05-06T19:13:12.4395686Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) 2020-05-06T19:13:12.4396165Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown Source) 2020-05-06T19:13:12.4396606Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 2020-05-06T19:13:12.4397015Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 2020-05-06T19:13:12.4397447Zat scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 2020-05-06T19:13:12.4397874Zat akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 2020-05-06T19:13:12.4398414Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 2020-05-06T19:13:12.4398879Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399321Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399737Zat akka.actor.Actor$class.aroundReceive(Actor.scala:517) 2020-05-06T19:13:12.4400138Zat akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 2020-05-06T19:13:12.4400552Zat akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 2020-05-06T19:13:12.4400930Zat akka.actor.ActorCell.invoke(ActorCell.scala:561) 2020-05-06T19:13:12.4401390Zat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225) 2020-05-06T19:13:12.4402135Zat akka.dispatch.Mailbox.exec(Mailbox.scala:235) 2020-05-06T19:13:12.4402540Zat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 2020-05-06T19:13:12.4402984Zat akka.d
[jira] [Created] (FLINK-17558) Partitions are released in TaskExecutor Main Thread
Gary Yao created FLINK-17558: Summary: Partitions are released in TaskExecutor Main Thread Key: FLINK-17558 URL: https://issues.apache.org/jira/browse/FLINK-17558 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.11.0 Reporter: Gary Yao Fix For: 1.11.0 Partitions are released in the main thread of the TaskExecutor (see the stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. because deleting files is blocking I/O. The partitions should be released in a devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate). {noformat} 2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable [0x7f7fd302c000] 2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE 2020-05-06T19:13:12.4384519Zat sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) 2020-05-06T19:13:12.4384971Zat sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146) 2020-05-06T19:13:12.4385465Zat sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231) 2020-05-06T19:13:12.4386000Zat sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) 2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126) 2020-05-06T19:13:12.4386968Zat org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93) 2020-05-06T19:13:12.4388088Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247) 2020-05-06T19:13:12.4388765Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208) 2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a java.lang.Object) 2020-05-06T19:13:12.4389905Zat org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290) 2020-05-06T19:13:12.4390481Zat org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80) 2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a java.util.HashMap) 2020-05-06T19:13:12.4391597Zat org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153) 2020-05-06T19:13:12.4392267Zat org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62) 2020-05-06T19:13:12.4392914Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776) 2020-05-06T19:13:12.4393366Zat sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 2020-05-06T19:13:12.4393813Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-06T19:13:12.4394257Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-06T19:13:12.4394693Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279) 2020-05-06T19:13:12.4395202Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) 2020-05-06T19:13:12.4395686Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) 2020-05-06T19:13:12.4396165Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown Source) 2020-05-06T19:13:12.4396606Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 2020-05-06T19:13:12.4397015Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 2020-05-06T19:13:12.4397447Zat scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 2020-05-06T19:13:12.4397874Zat akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 2020-05-06T19:13:12.4398414Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 2020-05-06T19:13:12.4398879Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399321Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399737Zat akka.actor.Actor$class.aroundReceive(Actor.scala:517) 2020-05-06T19:13:12.4400138Zat akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 2020-05-06T19:13:12.4400552Zat akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 2020-05-06T19:13:12.4400930Zat akka.actor.ActorCell.invoke(ActorCell.scala:561) 2020-05-06T19:13:12.4401390Zat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225) 2020-05-06T19:13:12.4402135Zat akka.dis
[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101125#comment-17101125 ] Gary Yao commented on FLINK-17194: -- I am able to reliably reproduce this issue by setting {{akka.ask.timeout}} to {{5s}}, which should be still a generous timeout for a local Flink cluster. The problem seems to be that sometimes releasing a partition is slow (due to file io), and this blocks the TaskExecutor's main thread. I have attached an example below. {noformat} Run TPC-DS query 39b ... {noformat} {noformat} 2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG org.apache.flink.runtime.io.network.partition.ResultPartition [] - CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand) -> SourceConversion(table=[default_catalog.default_database.inventory, source: [CsvTableSource(read fields: inv_date_sk, inv_item_sk , inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand]) (3/4) (76d0879cdd3bdb851b44f8dbb5b30999): Releasing ResultPartition feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 1 subpartitions]. 2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - Close org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0 2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - Closed org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0 2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG org.apache.flink.runtime.io.network.partition.ResultPartition [] - CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand) -> SourceConversion(table=[default_catalog.default_database.inventory, source: [CsvTableSource(read fields: inv_date_sk, inv_item_sk , inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand]) (3/4) (76d0879cdd3bdb851b44f8dbb5b30999): Released ResultPartition feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 1 subpartitions]. {noformat} Note that it takes more than 9 seconds to release the partition. I have added additional debug prints. I have also managed to invoke jstack at the right time on the TM process. The main thread is blocked on deleting {{FileChannelBoundedData#filePath}}. {noformat} 2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable [0x7f7fd302c000] 2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE 2020-05-06T19:13:12.4384519Zat sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) 2020-05-06T19:13:12.4384971Zat sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146) 2020-05-06T19:13:12.4385465Zat sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231) 2020-05-06T19:13:12.4386000Zat sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) 2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126) 2020-05-06T19:13:12.4386968Zat org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93) 2020-05-06T19:13:12.4388088Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247) 2020-05-06T19:13:12.4388765Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208) 2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a java.lang.Object) 2020-05-06T19:13:12.4389905Zat org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290) 2020-05-06T19:13:12.4390481Zat org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80) 2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a java.util.HashMap) 2020-05-06T19:13:12.4391597Zat org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153) 2020-05-06T19:13:12.4392267Zat org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62) 2020-05-06T19:13:12.4392914Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776) 2020-05-06T19:13:12.4393366Zat sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 2020-05-06T19:13:12.4393813Zat sun.reflect.D
[jira] [Comment Edited] (FLINK-17485) Add a thread dump REST API
[ https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100643#comment-17100643 ] Gary Yao edited comment on FLINK-17485 at 5/6/20, 10:02 AM: [~dixingx...@yeah.net] Is FLINK-14816 what you need? was (Author: gjy): Is FLINK-14816 what you need? > Add a thread dump REST API > -- > > Key: FLINK-17485 > URL: https://issues.apache.org/jira/browse/FLINK-17485 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST >Reporter: Xingxing Di >Priority: Major > > My team build a streaming computing platform based on flink in our company > internal. > As jobs and users grow, we spent lot's of time to help user with > troubleshooting. > Currently we must logon the server which running task manager, find the right > process through netstat -anp| grep "the flink data port", then run jstack > command. > We think it will be very convenient if flink provide a REST API for thread > dumping, with web UI support event better. > So we want to know: > * If community is already working on this > * Will this be a appropriate feature (add a REST API to dump threads), > because on the other hand, thread dump may be "expensive" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17485) Add a thread dump REST API
[ https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100643#comment-17100643 ] Gary Yao commented on FLINK-17485: -- Is FLINK-14816 what you need? > Add a thread dump REST API > -- > > Key: FLINK-17485 > URL: https://issues.apache.org/jira/browse/FLINK-17485 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST >Reporter: Xingxing Di >Priority: Major > > My team build a streaming computing platform based on flink in our company > internal. > As jobs and users grow, we spent lot's of time to help user with > troubleshooting. > Currently we must logon the server which running task manager, find the right > process through netstat -anp| grep "the flink data port", then run jstack > command. > We think it will be very convenient if flink provide a REST API for thread > dumping, with web UI support event better. > So we want to know: > * If community is already working on this > * Will this be a appropriate feature (add a REST API to dump threads), > because on the other hand, thread dump may be "expensive" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100568#comment-17100568 ] Gary Yao commented on FLINK-13553: -- In case it happens again, I added more debug information in c2540e44058a313a8dc7251dd5d37d2d52db2b44 > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)
[ https://issues.apache.org/jira/browse/FLINK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17501. Resolution: Fixed master: c2540e44058a313a8dc7251dd5d37d2d52db2b44 > Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, > Object) > --- > > Key: FLINK-17501 > URL: https://issues.apache.org/jira/browse/FLINK-17501 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, > Object)}}. If an Error is thrown, it should be logged as early as possible. > Currently we try to serialize and send an error response to the client before > logging the error; this can fail and mask the original exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17522) Document flink-jepsen Command Line Options
Gary Yao created FLINK-17522: Summary: Document flink-jepsen Command Line Options Key: FLINK-17522 URL: https://issues.apache.org/jira/browse/FLINK-17522 Project: Flink Issue Type: Improvement Components: Tests Reporter: Gary Yao Assignee: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)
Gary Yao created FLINK-17501: Summary: Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object) Key: FLINK-17501 URL: https://issues.apache.org/jira/browse/FLINK-17501 Project: Flink Issue Type: Bug Components: Runtime / Queryable State Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, Object)}}. If an Error is thrown, it should be logged as early as possible. Currently we try to serialize and send an error response to the client before logging the error; this can fail and mask the original exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
[ https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17473. Resolution: Fixed master: d385f20f607875a625df8b95917f7a8eaacea4a6 > Remove unused classes ArchivedExecutionBuilder, > ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder > - > > Key: FLINK-17473 > URL: https://issues.apache.org/jira/browse/FLINK-17473 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Remove unused test classes {{ArchivedExecutionBuilder}}, > {{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
[ https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17473: - Description: Remove unused test classes {{ArchivedExecutionBuilder}}, {{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}. (was: Remove unused classes {{ArchivedExecutionVertexBuilder}} and {{ArchivedExecutionJobVertexBuilder}}) > Remove unused classes ArchivedExecutionBuilder, > ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder > - > > Key: FLINK-17473 > URL: https://issues.apache.org/jira/browse/FLINK-17473 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Remove unused test classes {{ArchivedExecutionBuilder}}, > {{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
[ https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17473: - Summary: Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder (was: Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder) > Remove unused classes ArchivedExecutionBuilder, > ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder > - > > Key: FLINK-17473 > URL: https://issues.apache.org/jira/browse/FLINK-17473 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Remove unused classes {{ArchivedExecutionVertexBuilder}} and > {{ArchivedExecutionJobVertexBuilder}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder
[ https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17473: - Summary: Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder (was: Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder) > Remove unused classes ArchivedExecutionBuilder, > ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder > > > Key: FLINK-17473 > URL: https://issues.apache.org/jira/browse/FLINK-17473 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Remove unused classes {{ArchivedExecutionVertexBuilder}} and > {{ArchivedExecutionJobVertexBuilder}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098710#comment-17098710 ] Gary Yao commented on FLINK-13553: -- I am unable to reproduce this locally and on Azure. On Travis the builds are timing out consistently during compilation stage. > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-13553: Assignee: (was: Gary Yao) > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098710#comment-17098710 ] Gary Yao edited comment on FLINK-13553 at 5/4/20, 6:37 AM: --- I am unable to reproduce this locally and on Azure. On Travis the builds are timing out consistently during compilation stage. I have unassigned myself now since I am unable to make progress at the moment. was (Author: gjy): I am unable to reproduce this locally and on Azure. On Travis the builds are timing out consistently during compilation stage. > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder
Gary Yao created FLINK-17473: Summary: Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder Key: FLINK-17473 URL: https://issues.apache.org/jira/browse/FLINK-17473 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Remove unused classes {{ArchivedExecutionVertexBuilder}} and {{ArchivedExecutionJobVertexBuilder}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder
[ https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17473: - Issue Type: Task (was: Bug) > Remove unused classes ArchivedExecutionVertexBuilder and > ArchivedExecutionJobVertexBuilder > -- > > Key: FLINK-17473 > URL: https://issues.apache.org/jira/browse/FLINK-17473 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Fix For: 1.11.0 > > > Remove unused classes {{ArchivedExecutionVertexBuilder}} and > {{ArchivedExecutionJobVertexBuilder}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095327#comment-17095327 ] Gary Yao commented on FLINK-17194: -- The root exception seems to be {code} java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.taskexecutor.TaskExecutorGateway.submitTask(org.apache.flink.runtime.deployment.TaskDeploymentDescriptor,org.apache.flink.runtime.jobmaster.JobMasterId,org.apache.flink.api.common.time.Time) timed out. at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:227) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:888) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.dispatch.OnComplete.internal(Future.scala:263) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.dispatch.OnComplete.internal(Future.scala:261) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242] Caused by: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.taskexecutor.TaskExecutorGateway.submitTask(org.apache.flink.runtime.deployment.TaskDeploymentDescriptor,org.apache.flink.runtime.jobmaster.JobMasterId,org.apache.flink.api.common.time.Time) timed out. at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.submitTask(RpcTaskManagerGateway.java:72) ~[flink-dis
[jira] [Closed] (FLINK-16605) Add max limitation to the total number of slots
[ https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-16605. Resolution: Fixed master: 026a2b6d8ed3aab5bc29d998ba6e585fa5b2d9ef 9e69b270c8b192876dae128541aa73ae6e788e2f dcf9cc601f6ee1bb90a5d548043564a1a0522a25 > Add max limitation to the total number of slots > --- > > Key: FLINK-16605 > URL: https://issues.apache.org/jira/browse/FLINK-16605 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit > to the total number of slots. > To be specific: > - Introduce "cluster.number-of-slots.max" configuration option with default > value MAX_INT > - Make the SlotManager respect the max number of slots, when exceeded, it > would not allocate resource anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt
[ https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-17194: Assignee: Gary Yao > TPC-DS end-to-end test fails due to missing execution attempt > - > > Key: FLINK-17194 > URL: https://issues.apache.org/jira/browse/FLINK-17194 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Chesnay Schepler >Assignee: Gary Yao >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > {code:java} > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt d6bef26867c04f1c94903b06b60ec55f was not found. > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest
[ https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17369: - Parent: FLINK-16430 Issue Type: Sub-task (was: Task) > Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to > PipelinedRegionComputeUtilTest > > > Key: FLINK-17369 > URL: https://issues.apache.org/jira/browse/FLINK-17369 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually > testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests > should be moved to a new class {{PipelinedRegionComputeUtilTest}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest
[ https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17369: - Issue Type: Task (was: Bug) > Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to > PipelinedRegionComputeUtilTest > > > Key: FLINK-17369 > URL: https://issues.apache.org/jira/browse/FLINK-17369 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.11.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually > testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests > should be moved to a new class {{PipelinedRegionComputeUtilTest}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
[ https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao reassigned FLINK-13553: Assignee: Gary Yao > KvStateServerHandlerTest.readInboundBlocking unstable on Travis > --- > > Key: FLINK-13553 > URL: https://issues.apache.org/jira/browse/FLINK-13553 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Time Spent: 10m > Remaining Estimate: 0h > > The {{KvStateServerHandlerTest.readInboundBlocking}} and > {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a > {{TimeoutException}}. > https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology
[ https://issues.apache.org/jira/browse/FLINK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao closed FLINK-17180. Resolution: Fixed master: 23c13bbdbfa1b538a6c9e4e9622ef4563f69cd03 95b3c955f115dacb58b9695ae4192f729f5d5662 f9c23a0b86121d6361df403a05f75ba4b3902735 > Implement PipelinedRegion interface for SchedulingTopology > -- > > Key: FLINK-17180 > URL: https://issues.apache.org/jira/browse/FLINK-17180 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Implement {{Toplogy#getAllPipelinedRegions()}} and > {{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in > {{DefaultExecutionTopology}} to enable retrieval of pipelined regions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology
[ https://issues.apache.org/jira/browse/FLINK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Yao updated FLINK-17180: - Description: Implement {{Toplogy#getAllPipelinedRegions()}} and {{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in {{DefaultExecutionTopology}} to enable retrieval of pipelined regions. > Implement PipelinedRegion interface for SchedulingTopology > -- > > Key: FLINK-17180 > URL: https://issues.apache.org/jira/browse/FLINK-17180 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Implement {{Toplogy#getAllPipelinedRegions()}} and > {{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in > {{DefaultExecutionTopology}} to enable retrieval of pipelined regions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091368#comment-17091368 ] Gary Yao commented on FLINK-17328: -- I assigned you but I cannot promise a timely review at the moment. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > > JobVertexDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)