[jira] [Commented] (FLINK-12312) Temporarily disable CLI command for rescaling

2022-09-20 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607104#comment-17607104
 ] 

Gary Yao commented on FLINK-12312:
--

[~Zhanghao Chen] I updated the link in the issue. Note that the feature was 
disabled 3 years ago so the code might be in a state that would allow 
re-implementing this feature.

> Temporarily disable CLI command for rescaling
> -
>
> Key: FLINK-12312
> URL: https://issues.apache.org/jira/browse/FLINK-12312
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Temporarily remove support to rescale job via CLI. See this thread for more 
> details: https://www.mail-archive.com/dev@flink.apache.org/msg25266.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-12312) Temporarily disable CLI command for rescaling

2022-09-20 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-12312:
-
Description: Temporarily remove support to rescale job via CLI. See this 
thread for more details: 
https://www.mail-archive.com/dev@flink.apache.org/msg25266.html  (was: 
Temporarily remove support to rescale job via CLI. See this thread for more 
details: https://lists.apache.org/thread/oby7fmz9crphonxw3l0g8b9zvybg3sno)

> Temporarily disable CLI command for rescaling
> -
>
> Key: FLINK-12312
> URL: https://issues.apache.org/jira/browse/FLINK-12312
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Temporarily remove support to rescale job via CLI. See this thread for more 
> details: https://www.mail-archive.com/dev@flink.apache.org/msg25266.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists

2020-05-29 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17463.

Resolution: Fixed

1.11: 647f76283c900048e12361cf96d26db2a184b10b
master: c22d01d3bfbb1384f98664361f1491b806e95798

> BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » 
> FileAlreadyExists
> ---
>
> Key: FLINK-17463
> URL: https://issues.apache.org/jira/browse/FLINK-17463
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> CI run: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d
> {code}
> [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 
> s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest
> [ERROR] 
> testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest)  
> Time elapsed: 2.028 s  <<< ERROR!
> java.nio.file.FileAlreadyExistsException: 
> /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
>   at java.nio.file.Files.createDirectories(Files.java:727)
>   at 
> org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196)
>   at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222)
>   at 
> org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists

2020-05-27 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117735#comment-17117735
 ] 

Gary Yao commented on FLINK-17463:
--

This is most likely caused due to concurrent calls of
{code}
Files.createDirectories(...);
{code}

and 
{code}
FileUtils.deleteDirectory(...);
{code}

with the same arguments.


> BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » 
> FileAlreadyExists
> ---
>
> Key: FLINK-17463
> URL: https://issues.apache.org/jira/browse/FLINK-17463
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Gary Yao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> CI run: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d
> {code}
> [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 
> s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest
> [ERROR] 
> testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest)  
> Time elapsed: 2.028 s  <<< ERROR!
> java.nio.file.FileAlreadyExistsException: 
> /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
>   at java.nio.file.Files.createDirectories(Files.java:727)
>   at 
> org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196)
>   at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222)
>   at 
> org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17463) BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » FileAlreadyExists

2020-05-27 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17463:


Assignee: Gary Yao

> BlobCacheCleanupTest.testPermanentBlobCleanup:133->verifyJobCleanup:432 » 
> FileAlreadyExists
> ---
>
> Key: FLINK-17463
> URL: https://issues.apache.org/jira/browse/FLINK-17463
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Gary Yao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> CI run: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=317=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d
> {code}
> [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 2.73 
> s <<< FAILURE! - in org.apache.flink.runtime.blob.BlobCacheCleanupTest
> [ERROR] 
> testPermanentBlobCleanup(org.apache.flink.runtime.blob.BlobCacheCleanupTest)  
> Time elapsed: 2.028 s  <<< ERROR!
> java.nio.file.FileAlreadyExistsException: 
> /tmp/junit7984674749832216773/junit1629420330972938723/blobStore-296d1a51-8917-4db1-a920-5d4e17e6fa36/job_3bafac5425979b4fe2fa2c7726f8dd5b
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
>   at java.nio.file.Files.createDirectories(Files.java:727)
>   at 
> org.apache.flink.runtime.blob.BlobUtils.getStorageLocation(BlobUtils.java:196)
>   at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getStorageLocation(PermanentBlobCache.java:222)
>   at 
> org.apache.flink.runtime.blob.BlobServerCleanupTest.checkFilesExist(BlobServerCleanupTest.java:213)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.verifyJobCleanup(BlobCacheCleanupTest.java:432)
>   at 
> org.apache.flink.runtime.blob.BlobCacheCleanupTest.testPermanentBlobCleanup(BlobCacheCleanupTest.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-25 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-13553:


Assignee: (was: Gary Yao)

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

2020-05-23 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114792#comment-17114792
 ] 

Gary Yao commented on FLINK-16468:
--

[~longtimer] No problem, take care!

> BlobClient rapid retrieval retries on failure opens too many sockets
> 
>
> Key: FLINK-16468
> URL: https://issues.apache.org/jira/browse/FLINK-16468
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3, 1.9.2, 1.10.0
> Environment: Linux ubuntu servers running, patch current latest 
> Ubuntu patch current release java 8 JRE
>Reporter: Jason Kania
>Priority: Major
> Fix For: 1.11.0
>
>
> In situations where the BlobClient retrieval fails as in the following log, 
> rapid retries will exhaust the open sockets. All the retries happen within a 
> few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - 
> Failed to fetch BLOB 
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
>  from aaa-1/10.0.1.1:45145 and store it under 
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 
> Retrying...
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address 
> aaa-1/10.0.1.1:45145
>  at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)
>  at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
>  at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
>  at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
>  at java.net.Socket.createImpl(Socket.java:478)
>  at java.net.Socket.connect(Socket.java:605)
>  at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)
>  ... 8 more
> {noformat}
>  The retries should have some form of backoff in this situation to avoid 
> flooding the logs and exhausting other resources on the server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-22 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114285#comment-17114285
 ] 

Gary Yao commented on FLINK-13553:
--

Added more debug/trace logs.

1.11:
7cbdd91413ee26d00d9015581ce2fa8538fd5963
3e18c109051821176575a15a6b10aaa5cc2e3e12 

master: 
564e8802a8f1a8c92d3c46686b109dfb826856fe
0cc7aae86dfdb5e51c661620d39caa79a16fd647



> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16649) Support Java 14

2020-05-22 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16649:
-
Description: Resolve issues occurring when using Flink with Java 14.

> Support Java 14
> ---
>
> Key: FLINK-16649
> URL: https://issues.apache.org/jira/browse/FLINK-16649
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Major
>
> Resolve issues occurring when using Flink with Java 14.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16619) Misleading SlotManagerImpl logging for slot reports of unknown task manager

2020-05-22 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113922#comment-17113922
 ] 

Gary Yao commented on FLINK-16619:
--

The proposal sounds reasonable to me. 

> Misleading SlotManagerImpl logging for slot reports of unknown task manager
> ---
>
> Key: FLINK-16619
> URL: https://issues.apache.org/jira/browse/FLINK-16619
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Priority: Major
>
> If the SlotManager receives a slot report from an unknown task manager it 
> logs 2 messages:
> {code}
> public boolean reportSlotStatus(InstanceID instanceId, SlotReport slotReport) 
> {
>   [...]
>   LOG.debug("Received slot report from instance {}: {}.", instanceId, 
> slotReport);
>   TaskManagerRegistration taskManagerRegistration = 
> taskManagerRegistrations.get(instanceId);
>   if (null != taskManagerRegistration) {
>   [...]
>   } else {
>   LOG.debug("Received slot report for unknown task manager with 
> instance id {}. Ignoring this report.", instanceId);
>   [...]
>   }
> }
> {code}
> This leads to misleading output since it appears like the slot manager 
> received 2 separate slot reports, with the first being for a known instance, 
> the latter for an unknown one. This cost some time as I couldn't figure out 
> why the "latter" report was suddenly being rejected.
> I propose moving the first debug message into the non-null branch.
> [~trohrmann] WDYT?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

2020-05-22 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113920#comment-17113920
 ] 

Gary Yao commented on FLINK-16468:
--

Are there any news [~longtimer]?

> BlobClient rapid retrieval retries on failure opens too many sockets
> 
>
> Key: FLINK-16468
> URL: https://issues.apache.org/jira/browse/FLINK-16468
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3, 1.9.2, 1.10.0
> Environment: Linux ubuntu servers running, patch current latest 
> Ubuntu patch current release java 8 JRE
>Reporter: Jason Kania
>Priority: Major
> Fix For: 1.11.0
>
>
> In situations where the BlobClient retrieval fails as in the following log, 
> rapid retries will exhaust the open sockets. All the retries happen within a 
> few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - 
> Failed to fetch BLOB 
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
>  from aaa-1/10.0.1.1:45145 and store it under 
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-0004 
> Retrying...
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address 
> aaa-1/10.0.1.1:45145
>  at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:100)
>  at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
>  at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
>  at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
>  at java.net.Socket.createImpl(Socket.java:478)
>  at java.net.Socket.connect(Socket.java:605)
>  at org.apache.flink.runtime.blob.BlobClient.(BlobClient.java:95)
>  ... 8 more
> {noformat}
>  The retries should have some form of backoff in this situation to avoid 
> flooding the logs and exhausting other resources on the server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16605) Add max limitation to the total number of slots

2020-05-22 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113840#comment-17113840
 ] 

Gary Yao edited comment on FLINK-16605 at 5/22/20, 8:16 AM:


[~karmagyz] Only in exceptional cases we add release notes for new features 
(see 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html).
 If you think it's justified, you can add a release note but it is at the 
discretion of the release manager to decide whether it will be included or not.


was (Author: gjy):
[~karmagyz] Only in exceptional cases we add release notes for new features 
(see 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html).
 If you think it's justified, you can add a release note but it is in the 
discretion of the release manager to decide whether it will be included or not.

> Add max limitation to the total number of slots
> ---
>
> Key: FLINK-16605
> URL: https://issues.apache.org/jira/browse/FLINK-16605
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit 
> to the total number of slots.
> To be specific:
> - Introduce "cluster.number-of-slots.max" configuration option with default 
> value MAX_INT
> - Make the SlotManager respect the max number of slots, when exceeded, it 
> would not allocate resource anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16605) Add max limitation to the total number of slots

2020-05-22 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113840#comment-17113840
 ] 

Gary Yao commented on FLINK-16605:
--

[~karmagyz] Only in exceptional cases we add release notes for new features 
(see 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html).
 If you think it's justified, you can add a release note but it is in the 
discretion of the release manager to decide whether it will be included or not.

> Add max limitation to the total number of slots
> ---
>
> Key: FLINK-16605
> URL: https://issues.apache.org/jira/browse/FLINK-16605
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit 
> to the total number of slots.
> To be specific:
> - Introduce "cluster.number-of-slots.max" configuration option with default 
> value MAX_INT
> - Make the SlotManager respect the max number of slots, when exceeded, it 
> would not allocate resource anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests

2020-05-21 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17794.

Resolution: Fixed

1.11: 6a4714fdeff96d54db5fde5fac9b0eb355886b47
master: 2b2c574f102689b3cde9deac0bd1bcf78ad7ebc7

> Tear down installed software in reverse order in Jepsen Tests
> -
>
> Key: FLINK-17794
> URL: https://issues.apache.org/jira/browse/FLINK-17794
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.1, 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Tear down installed software in reverse order in Jepsen tests. This mitigates 
> the issue that sometimes YARN's NodeManager directories cannot be removed 
> using {{rm -rf}} because Flink processes keep running and generate files 
> after the YARN NodeManager is shut down. {{rm -r}} removes files recursively 
> but if files are created in the background concurrently, the command can 
> still fail with a non-zero exit code.
> {noformat}
> sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot 
> remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
>  Directory not empty\nrm: cannot remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
>  Directory not empty
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-19 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-13553:
-
Comment: was deleted

(was: Root cause in the new case is:
{code}
java.lang.IllegalStateException: Version Mismatch:  Found 123238213, Expected: 
2040641296.
at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) 
~[flink-core-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.queryablestate.network.messages.MessageSerializer.deserializeHeader(MessageSerializer.java:232)
 ~[flink-queryable-state-client-java-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.queryablestate.network.AbstractServerHandler.channelRead(AbstractServerHandler.java:110)
 [flink-queryable-state-client-java-1.11-SNAPSHOT.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.embedded.EmbeddedChannel.writeInbound(EmbeddedChannel.java:343)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.queryablestate.network.KvStateServerHandlerTest.testUnexpectedMessage(KvStateServerHandlerTest.java:491)
 [test-classes/:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_242]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_242]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_242]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_242]
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 [junit-4.12.jar:4.12]
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 [junit-4.12.jar:4.12]
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 [junit-4.12.jar:4.12]
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 [junit-4.12.jar:4.12]
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 
[junit-4.12.jar:4.12]
at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
[junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
[junit-4.12.jar:4.12]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 [junit-4.12.jar:4.12]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 [junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
[junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 

[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-19 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1724#comment-1724
 ] 

Gary Yao commented on FLINK-13553:
--

Root cause in the new case is:
{code}
java.lang.IllegalStateException: Version Mismatch:  Found 123238213, Expected: 
2040641296.
at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) 
~[flink-core-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.queryablestate.network.messages.MessageSerializer.deserializeHeader(MessageSerializer.java:232)
 ~[flink-queryable-state-client-java-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.queryablestate.network.AbstractServerHandler.channelRead(AbstractServerHandler.java:110)
 [flink-queryable-state-client-java-1.11-SNAPSHOT.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.shaded.netty4.io.netty.channel.embedded.EmbeddedChannel.writeInbound(EmbeddedChannel.java:343)
 [flink-shaded-netty-4.1.39.Final-10.0.jar:?]
at 
org.apache.flink.queryablestate.network.KvStateServerHandlerTest.testUnexpectedMessage(KvStateServerHandlerTest.java:491)
 [test-classes/:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_242]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_242]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_242]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_242]
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 [junit-4.12.jar:4.12]
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 [junit-4.12.jar:4.12]
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 [junit-4.12.jar:4.12]
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 [junit-4.12.jar:4.12]
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 
[junit-4.12.jar:4.12]
at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
[junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
[junit-4.12.jar:4.12]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 [junit-4.12.jar:4.12]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 [junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
[junit-4.12.jar:4.12]
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 

[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-05-19 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17194:


Assignee: Gary Yao

> TPC-DS end-to-end test fails due to missing execution attempt
> -
>
> Key: FLINK-17194
> URL: https://issues.apache.org/jira/browse/FLINK-17194
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Chesnay Schepler
>Assignee: Gary Yao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-05-19 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17194:


Assignee: (was: Gary Yao)

> TPC-DS end-to-end test fails due to missing execution attempt
> -
>
> Key: FLINK-17194
> URL: https://issues.apache.org/jira/browse/FLINK-17194
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-05-19 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17194:


Assignee: (was: Gary Yao)

> TPC-DS end-to-end test fails due to missing execution attempt
> -
>
> Key: FLINK-17194
> URL: https://issues.apache.org/jira/browse/FLINK-17194
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-19 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-13553:


Assignee: Gary Yao

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17687.

Resolution: Fixed

1.11:
656d56e99e3c158c7252db04bc034cce77ad39ba
aa2a5709309ef8149607cc6ac696cd990a8aef81
2aa45cdcc88d75837df165fcde71200d796deee7

master:
be8c02e397943d668c4ff64e4c491a560136e2e1
12d662c9da2dc3e18fdd3d752ddeeb07df1f5945
ed74173c087fe879f5728b810e204bafb69bdae6

> Collect TaskManager logs in Mesos Jepsen Tests
> --
>
> Key: FLINK-17687
> URL: https://issues.apache.org/jira/browse/FLINK-17687
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> TM logs are collected in standalone mode and YARN. However, for Mesos tests, 
> TM logs are not collected at the end of the test. We should download all log 
> files generated in the in the mesos agent directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17687:
-
Description: TM logs are collected in standalone mode and YARN. However, 
for Mesos tests, TM logs are not collected at the end of the test. We should 
download all log files generated in the in the mesos agent directories.

> Collect TaskManager logs in Mesos Jepsen Tests
> --
>
> Key: FLINK-17687
> URL: https://issues.apache.org/jira/browse/FLINK-17687
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> TM logs are collected in standalone mode and YARN. However, for Mesos tests, 
> TM logs are not collected at the end of the test. We should download all log 
> files generated in the in the mesos agent directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-1:
-
Fix Version/s: 1.11.0

> Make Mesos Jepsen Tests pass with Hadoop-free Flink 
> 
>
> Key: FLINK-1
> URL: https://issues.apache.org/jira/browse/FLINK-1
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. 
> Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for 
> the TM processes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-1.

Resolution: Fixed

1.11: f46735cb4963af616c0e8538331bed8739a1d353
master: 81ffe8a271a3d4bf7867f7b8b75ffc4cc6707d85

> Make Mesos Jepsen Tests pass with Hadoop-free Flink 
> 
>
> Key: FLINK-1
> URL: https://issues.apache.org/jira/browse/FLINK-1
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
>
> Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. 
> Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for 
> the TM processes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17687:
-
Issue Type: Bug  (was: Improvement)

> Collect TaskManager logs in Mesos Jepsen Tests
> --
>
> Key: FLINK-17687
> URL: https://issues.apache.org/jira/browse/FLINK-17687
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17792.

Resolution: Fixed

1.11: d8a77cbf93007bf970963a4499aa06501c0d9808
master: 417936d8722c7b466f22bc13b9063e7298e0cbd6

> Failing to invoking jstack on TM processes should not fail Jepsen Tests
> ---
>
> Key: FLINK-17792
> URL: https://issues.apache.org/jira/browse/FLINK-17792
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.1, 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {{jstack}} can fail if the JVM process exits prematurely while or before we 
> invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits 
> the Jepsen Tests prematurely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager

2020-05-18 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110192#comment-17110192
 ] 

Gary Yao commented on FLINK-17676:
--

[~rmetzger] I think Command Line Client is not the right component.

> Is there some way to rollback the .out file of TaskManager
> --
>
> Key: FLINK-17676
> URL: https://issues.apache.org/jira/browse/FLINK-17676
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Reporter: JieFang.He
>Priority: Major
>
> When use .print() API, the result all write to the out file, But there is no 
> way to rollback the out file.
>  
> out in flink-daemon.sh
> {code:java}
> // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath 
> "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" 
> ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null &
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15813) Set default value of jobmanager.execution.failover-strategy to region

2020-05-18 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110130#comment-17110130
 ] 

Gary Yao commented on FLINK-15813:
--

This issue is probably only for the documentation page and doesn't require 
changing tests.

{{FailoverStrategyFactoryLoader}} already uses "region" as the default value:
{code}
// the default NG failover strategy is the region failover 
strategy.
// TODO: Remove the overridden default value when removing 
legacy scheduler
//  and change the default value of 
JobManagerOptions.EXECUTION_FAILOVER_STRATEGY
//  to be "region"
final String strategyParam = config.getString(
JobManagerOptions.EXECUTION_FAILOVER_STRATEGY,
PIPELINED_REGION_RESTART_STRATEGY_NAME);
{code}

> Set default value of jobmanager.execution.failover-strategy to region
> -
>
> Key: FLINK-15813
> URL: https://issues.apache.org/jira/browse/FLINK-15813
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Zhu Zhu
>Priority: Blocker
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>
> We should set the default value of {{jobmanager.execution.failover-strategy}} 
> to {{region}}. This might require to adapt existing tests to make them pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Logfiles and enable reading Logs by Filename

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16303:
-
Release Note: Requesting an unavailable log or stdout file from the 
JobManager's HTTP server returns status code 404 now. In previous releases, the 
HTTP server would return a file with '(file unavailable)' as its content.

> Add Rest Handler to list JM Logfiles and enable reading Logs by Filename
> 
>
> Key: FLINK-16303
> URL: https://issues.apache.org/jira/browse/FLINK-16303
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/log/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16303) Add REST Log List and enable reading Logs by Filename

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16303:
-
Summary: Add REST Log List and enable reading Logs by Filename  (was: Add 
Rest Handler to list JM Log Files and enable reading Logs by Filename)

> Add REST Log List and enable reading Logs by Filename
> -
>
> Key: FLINK-16303
> URL: https://issues.apache.org/jira/browse/FLINK-16303
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/log/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Log Files and enable reading Logs by Filename

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16303:
-
Summary: Add Rest Handler to list JM Log Files and enable reading Logs by 
Filename  (was: Add JobMananger Log List and enable reading Logs by Filename)

> Add Rest Handler to list JM Log Files and enable reading Logs by Filename
> -
>
> Key: FLINK-16303
> URL: https://issues.apache.org/jira/browse/FLINK-16303
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/log/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16303) Add Rest Handler to list JM Logfiles and enable reading Logs by Filename

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16303:
-
Summary: Add Rest Handler to list JM Logfiles and enable reading Logs by 
Filename  (was: Add REST Log List and enable reading Logs by Filename)

> Add Rest Handler to list JM Logfiles and enable reading Logs by Filename
> 
>
> Key: FLINK-16303
> URL: https://issues.apache.org/jira/browse/FLINK-16303
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/log/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16303) Add JobMananger Log List and enable reading Logs by Filename

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16303:
-
Summary: Add JobMananger Log List and enable reading Logs by Filename  
(was: add log list and read log by name for jobmanager)

> Add JobMananger Log List and enable reading Logs by Filename
> 
>
> Key: FLINK-16303
> URL: https://issues.apache.org/jira/browse/FLINK-16303
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/log/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13987) Better TM/JM Log Display

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-13987:
-
Summary: Better TM/JM Log Display  (was: add log list and read log by name)

> Better TM/JM Log Display
> 
>
> Key: FLINK-13987
> URL: https://issues.apache.org/jira/browse/FLINK-13987
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
> Fix For: 1.11.0
>
>
> As the job running, the log files are becoming large.
> As the application runs on JVM, sometimes the user needs to see the log of 
> GC, but there isn't this content.
> Above all, we need new apis:
>  *  list taskmanager all log file
>  ** /taskmanagers/taskmanagerid/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "taskmanager.log",
>   "size": 12529
> }
>   ]
> } {code}
>  * read taskmanager log file
>  **  /taskmanagers/logs/[filename]
>  ** response: same as taskmanager’s log
>  * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/logs/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13987) add log list and read log by name

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-13987.

Resolution: Fixed

> add log list and read log by name
> -
>
> Key: FLINK-13987
> URL: https://issues.apache.org/jira/browse/FLINK-13987
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
> Fix For: 1.11.0
>
>
> As the job running, the log files are becoming large.
> As the application runs on JVM, sometimes the user needs to see the log of 
> GC, but there isn't this content.
> Above all, we need new apis:
>  *  list taskmanager all log file
>  ** /taskmanagers/taskmanagerid/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "taskmanager.log",
>   "size": 12529
> }
>   ]
> } {code}
>  * read taskmanager log file
>  **  /taskmanagers/logs/[filename]
>  ** response: same as taskmanager’s log
>  * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/logs/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13987) add log list and read log by name

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-13987:
-
Fix Version/s: 1.11.0

> add log list and read log by name
> -
>
> Key: FLINK-13987
> URL: https://issues.apache.org/jira/browse/FLINK-13987
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
> Fix For: 1.11.0
>
>
> As the job running, the log files are becoming large.
> As the application runs on JVM, sometimes the user needs to see the log of 
> GC, but there isn't this content.
> Above all, we need new apis:
>  *  list taskmanager all log file
>  ** /taskmanagers/taskmanagerid/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "taskmanager.log",
>   "size": 12529
> }
>   ]
> } {code}
>  * read taskmanager log file
>  **  /taskmanagers/logs/[filename]
>  ** response: same as taskmanager’s log
>  * list jobmanager all log file
>  ** /jobmanager/logs
>  ** 
> {code:java}
> {
>   "logs": [
> {
>   "name": "jobmanager.log",
>   "size": 12529
> }
>   ]
> }{code}
>  * read jobmanager log file
>  **  /jobmanager/logs/[filename]
>  ** response: same as jobmanager's log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16863) Sorting descendingly on the last modified date of LogInfo

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16863:
-
Parent: (was: FLINK-13987)
Issue Type: Improvement  (was: Sub-task)

> Sorting descendingly on the last modified date of LogInfo
> -
>
> Key: FLINK-16863
> URL: https://issues.apache.org/jira/browse/FLINK-16863
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Affects Versions: 1.11.0
>Reporter: lining
>Priority: Major
>
> Sorting descendingly on the last modified date could a user be able to see 
> the most recent files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16863) Sorting descendingly on the last modified date of LogInfo

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-16863:
-
Affects Version/s: 1.11.0

> Sorting descendingly on the last modified date of LogInfo
> -
>
> Key: FLINK-16863
> URL: https://issues.apache.org/jira/browse/FLINK-16863
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Affects Versions: 1.11.0
>Reporter: lining
>Priority: Major
>
> Sorting descendingly on the last modified date could a user be able to see 
> the most recent files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17794:
-
Description: 
Tear down installed software in reverse order in Jepsen tests. This mitigates 
the issue that sometimes YARN's NodeManager directories cannot be removed using 
{{rm -rf}} because Flink processes keep running and generate files after the 
YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files 
are created in the background concurrently, the command can still fail with a 
non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}

  was:
Tear down installed software in reverse order in Jepsen Tests. This mitigates 
the issue that sometimes YARN's NodeManager directories cannot be removed using 
{{rm -rf}} because Flink processes keep running and generate files after the 
YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files 
are created in the background concurrently, the command can still fail with a 
non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}


> Tear down installed software in reverse order in Jepsen Tests
> -
>
> Key: FLINK-17794
> URL: https://issues.apache.org/jira/browse/FLINK-17794
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.1, 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
> Fix For: 1.11.0
>
>
> Tear down installed software in reverse order in Jepsen tests. This mitigates 
> the issue that sometimes YARN's NodeManager directories cannot be removed 
> using {{rm -rf}} because Flink processes keep running and generate files 
> after the YARN NodeManager is shut down. {{rm -r}} removes files recursively 
> but if files are created in the background concurrently, the command can 
> still fail with a non-zero exit code.
> {noformat}
> sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot 
> remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
>  Directory not empty\nrm: cannot remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
>  Directory not empty
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17794:
-
Description: 
Tear down installed software in reverse order in Jepsen Tests. This mitigates 
the issue that sometimes YARN's NodeManager directories cannot be removed using 
{{rm -rf}} because Flink processes keep running and generate files after the 
YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files 
are created in the background concurrently, the command can still fail with a 
non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}

  was:
Tear down installed software in reverse order in Jepsen Tests. This mitigates 
the issue that sometimes hadoop's node manager directories cannot be removed 
using {{rm -rf}} because Flink processes keep running and generate files after 
the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if 
files are created in the background concurrently, the command can still fail 
with a non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}


> Tear down installed software in reverse order in Jepsen Tests
> -
>
> Key: FLINK-17794
> URL: https://issues.apache.org/jira/browse/FLINK-17794
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.1, 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
> Fix For: 1.11.0
>
>
> Tear down installed software in reverse order in Jepsen Tests. This mitigates 
> the issue that sometimes YARN's NodeManager directories cannot be removed 
> using {{rm -rf}} because Flink processes keep running and generate files 
> after the YARN NodeManager is shut down. {{rm -r}} removes files recursively 
> but if files are created in the background concurrently, the command can 
> still fail with a non-zero exit code.
> {noformat}
> sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot 
> remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
>  Directory not empty\nrm: cannot remove 
> '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
>  Directory not empty
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-17794:


 Summary: Tear down installed software in reverse order in Jepsen 
Tests
 Key: FLINK-17794
 URL: https://issues.apache.org/jira/browse/FLINK-17794
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.1, 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Tear down installed software in reverse order in Jepsen Tests. This mitigates 
the issue that sometimes hadoop's node manager directories cannot be removed 
using {{rm -rf}} because Flink processes keep running and generate files after 
the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if 
files are created in the background concurrently, the command can still fail 
with a non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-17792:


 Summary: Failing to invoking jstack on TM processes should not 
fail Jepsen Tests
 Key: FLINK-17792
 URL: https://issues.apache.org/jira/browse/FLINK-17792
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.1, 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


{{jstack}} can fail if the JVM process exits prematurely while or before we 
invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits the 
Jepsen Tests prematurely.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reopened FLINK-17595:
--

> JobExceptionsInfo. ExecutionExceptionInfo miss getter method
> 
>
> Key: FLINK-17595
> URL: https://issues.apache.org/jira/browse/FLINK-17595
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Wei Zhang
>Priority: Minor
> Fix For: 1.11.0
>
>
> {code:java}
>   public static final class ExecutionExceptionInfo {
>   public static final String FIELD_NAME_EXCEPTION = "exception";
>   public static final String FIELD_NAME_TASK = "task";
>   public static final String FIELD_NAME_LOCATION = "location";
>   public static final String FIELD_NAME_TIMESTAMP = "timestamp";
>   @JsonProperty(FIELD_NAME_EXCEPTION)
>   private final String exception;
>   @JsonProperty(FIELD_NAME_TASK)
>   private final String task;
>   @JsonProperty(FIELD_NAME_LOCATION)
>   private final String location;
>   @JsonProperty(FIELD_NAME_TIMESTAMP)
>   private final long timestamp;
>   @JsonCreator
>   public ExecutionExceptionInfo(
>   @JsonProperty(FIELD_NAME_EXCEPTION) String exception,
>   @JsonProperty(FIELD_NAME_TASK) String task,
>   @JsonProperty(FIELD_NAME_LOCATION) String location,
>   @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) {
>   this.exception = Preconditions.checkNotNull(exception);
>   this.task = Preconditions.checkNotNull(task);
>   this.location = Preconditions.checkNotNull(location);
>   this.timestamp = timestamp;
>   }
>   @Override
>   public boolean equals(Object o) {
>   if (this == o) {
>   return true;
>   }
>   if (o == null || getClass() != o.getClass()) {
>   return false;
>   }
>   JobExceptionsInfo.ExecutionExceptionInfo that = 
> (JobExceptionsInfo.ExecutionExceptionInfo) o;
>   return timestamp == that.timestamp &&
>   Objects.equals(exception, that.exception) &&
>   Objects.equals(task, that.task) &&
>   Objects.equals(location, that.location);
>   }
>   @Override
>   public int hashCode() {
>   return Objects.hash(timestamp, exception, task, 
> location);
>   }
> {code}
> I found jobexceptionsinfo.executionexceptioninfo has no getter method for the 
> field, is it missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17595.

Resolution: Won't Do

> JobExceptionsInfo. ExecutionExceptionInfo miss getter method
> 
>
> Key: FLINK-17595
> URL: https://issues.apache.org/jira/browse/FLINK-17595
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Wei Zhang
>Priority: Minor
> Fix For: 1.11.0
>
>
> {code:java}
>   public static final class ExecutionExceptionInfo {
>   public static final String FIELD_NAME_EXCEPTION = "exception";
>   public static final String FIELD_NAME_TASK = "task";
>   public static final String FIELD_NAME_LOCATION = "location";
>   public static final String FIELD_NAME_TIMESTAMP = "timestamp";
>   @JsonProperty(FIELD_NAME_EXCEPTION)
>   private final String exception;
>   @JsonProperty(FIELD_NAME_TASK)
>   private final String task;
>   @JsonProperty(FIELD_NAME_LOCATION)
>   private final String location;
>   @JsonProperty(FIELD_NAME_TIMESTAMP)
>   private final long timestamp;
>   @JsonCreator
>   public ExecutionExceptionInfo(
>   @JsonProperty(FIELD_NAME_EXCEPTION) String exception,
>   @JsonProperty(FIELD_NAME_TASK) String task,
>   @JsonProperty(FIELD_NAME_LOCATION) String location,
>   @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) {
>   this.exception = Preconditions.checkNotNull(exception);
>   this.task = Preconditions.checkNotNull(task);
>   this.location = Preconditions.checkNotNull(location);
>   this.timestamp = timestamp;
>   }
>   @Override
>   public boolean equals(Object o) {
>   if (this == o) {
>   return true;
>   }
>   if (o == null || getClass() != o.getClass()) {
>   return false;
>   }
>   JobExceptionsInfo.ExecutionExceptionInfo that = 
> (JobExceptionsInfo.ExecutionExceptionInfo) o;
>   return timestamp == that.timestamp &&
>   Objects.equals(exception, that.exception) &&
>   Objects.equals(task, that.task) &&
>   Objects.equals(location, that.location);
>   }
>   @Override
>   public int hashCode() {
>   return Objects.hash(timestamp, exception, task, 
> location);
>   }
> {code}
> I found jobexceptionsinfo.executionexceptioninfo has no getter method for the 
> field, is it missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method

2020-05-18 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17595:
-
Fix Version/s: (was: 1.11.0)

> JobExceptionsInfo. ExecutionExceptionInfo miss getter method
> 
>
> Key: FLINK-17595
> URL: https://issues.apache.org/jira/browse/FLINK-17595
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Wei Zhang
>Priority: Minor
>
> {code:java}
>   public static final class ExecutionExceptionInfo {
>   public static final String FIELD_NAME_EXCEPTION = "exception";
>   public static final String FIELD_NAME_TASK = "task";
>   public static final String FIELD_NAME_LOCATION = "location";
>   public static final String FIELD_NAME_TIMESTAMP = "timestamp";
>   @JsonProperty(FIELD_NAME_EXCEPTION)
>   private final String exception;
>   @JsonProperty(FIELD_NAME_TASK)
>   private final String task;
>   @JsonProperty(FIELD_NAME_LOCATION)
>   private final String location;
>   @JsonProperty(FIELD_NAME_TIMESTAMP)
>   private final long timestamp;
>   @JsonCreator
>   public ExecutionExceptionInfo(
>   @JsonProperty(FIELD_NAME_EXCEPTION) String exception,
>   @JsonProperty(FIELD_NAME_TASK) String task,
>   @JsonProperty(FIELD_NAME_LOCATION) String location,
>   @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) {
>   this.exception = Preconditions.checkNotNull(exception);
>   this.task = Preconditions.checkNotNull(task);
>   this.location = Preconditions.checkNotNull(location);
>   this.timestamp = timestamp;
>   }
>   @Override
>   public boolean equals(Object o) {
>   if (this == o) {
>   return true;
>   }
>   if (o == null || getClass() != o.getClass()) {
>   return false;
>   }
>   JobExceptionsInfo.ExecutionExceptionInfo that = 
> (JobExceptionsInfo.ExecutionExceptionInfo) o;
>   return timestamp == that.timestamp &&
>   Objects.equals(exception, that.exception) &&
>   Objects.equals(task, that.task) &&
>   Objects.equals(location, that.location);
>   }
>   @Override
>   public int hashCode() {
>   return Objects.hash(timestamp, exception, task, 
> location);
>   }
> {code}
> I found jobexceptionsinfo.executionexceptioninfo has no getter method for the 
> field, is it missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink

2020-05-17 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-1:
-
Description: Since FLINK-11086, we can no longer build a Flink distribution 
with Hadoop. Therefore, we need to set the {{HADOOP_CLASSPATH}} environment 
variable for the TM processes.

> Make Mesos Jepsen Tests pass with Hadoop-free Flink 
> 
>
> Key: FLINK-1
> URL: https://issues.apache.org/jira/browse/FLINK-1
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>
> Since FLINK-11086, we can no longer build a Flink distribution with Hadoop. 
> Therefore, we need to set the {{HADOOP_CLASSPATH}} environment variable for 
> the TM processes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17522) Document flink-jepsen Command Line Options

2020-05-17 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17522:
-
Description: Document command line options that can be passed to {{lein run 
test}}.

> Document flink-jepsen Command Line Options
> --
>
> Key: FLINK-17522
> URL: https://issues.apache.org/jira/browse/FLINK-17522
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Document command line options that can be passed to {{lein run test}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17522) Document flink-jepsen Command Line Options

2020-05-17 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17522.

Resolution: Fixed

master: b606cbaf36f9e206f44243dfdc7e8005e92d2d66

> Document flink-jepsen Command Line Options
> --
>
> Key: FLINK-17522
> URL: https://issues.apache.org/jira/browse/FLINK-17522
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Document command line options that can be passed to {{lein run test}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17522) Document flink-jepsen Command Line Options

2020-05-17 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17522:
-
Fix Version/s: 1.11.0

> Document flink-jepsen Command Line Options
> --
>
> Key: FLINK-17522
> URL: https://issues.apache.org/jira/browse/FLINK-17522
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink

2020-05-17 Thread Gary Yao (Jira)
Gary Yao created FLINK-1:


 Summary: Make Mesos Jepsen Tests pass with Hadoop-free Flink 
 Key: FLINK-1
 URL: https://issues.apache.org/jira/browse/FLINK-1
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager

2020-05-14 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107277#comment-17107277
 ] 

Gary Yao commented on FLINK-17676:
--

[~hejiefang] Can you use [logrotate|https://linux.die.net/man/8/logrotate]?

> Is there some way to rollback the .out file of TaskManager
> --
>
> Key: FLINK-17676
> URL: https://issues.apache.org/jira/browse/FLINK-17676
> Project: Flink
>  Issue Type: Improvement
>Reporter: JieFang.He
>Priority: Major
>
> When use .print() API, the result all write to the out file, But there is no 
> way to rollback the out file.
>  
> out in flink-daemon.sh
> {code:java}
> // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath 
> "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" 
> ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null &
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests

2020-05-14 Thread Gary Yao (Jira)
Gary Yao created FLINK-17687:


 Summary: Collect TaskManager logs in Mesos Jepsen Tests
 Key: FLINK-17687
 URL: https://issues.apache.org/jira/browse/FLINK-17687
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17676) Is there some way to rollback the .out file of TaskManager

2020-05-14 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107033#comment-17107033
 ] 

Gary Yao commented on FLINK-17676:
--

You are writing "rollback" but I suspect you could mean to roll the .out file, 
i.e., close and archive the current .out file if it is exceeds a size and start 
writing into a new one.

> Is there some way to rollback the .out file of TaskManager
> --
>
> Key: FLINK-17676
> URL: https://issues.apache.org/jira/browse/FLINK-17676
> Project: Flink
>  Issue Type: Improvement
>Reporter: JieFang.He
>Priority: Major
>
> When use .print() API, the result all write to the out file, But there is no 
> way to rollback the out file.
>  
> out in flink-daemon.sh
> {code:java}
> // $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath 
> "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" 
> ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null &
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-13 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17616.

Resolution: Fixed

master: 56ea8d15dee58d2a79b6b9c646a8bfb2cb9f0c23

> Temporarily increase akka.ask.timeout in TPC-DS e2e test
> 
>
> Key: FLINK-17616
> URL: https://issues.apache.org/jira/browse/FLINK-17616
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Until FLINK-17558 is fixed, we should increase the akka.ask.timeout in the 
> e2e test to mitigate FLINK-17194 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method

2020-05-12 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105322#comment-17105322
 ] 

Gary Yao commented on FLINK-17595:
--

The REST API, i.e., the specification of the JSON requests and responses, is a 
public API and guaranteed to be stable. However, the Java classes used to 
implement the REST API are not part of the public API (not annotated with 
{{@Public}} or {{@PublicEvolving}}). Even if we added a getter to 
{{ExecutionExceptionInfo}}, we would not guarantee that the class won't be 
renamed or moved to a different package in a next Flink release. Moreover, the 
{{RestClusterClient}} is not part of the public API; 
{{RestClusterClient#sendRequest()}} is even annotated with 
{{@VisibleForTesting}}. The problem seems to be that in the Flink project there 
is not yet a reference client implementation to programmatically interact with 
the REST API. All in all, I am currently against adding a getter to 
{{ExecutionExceptionInfo}}. What you can do in the meantime as a workaround is:

* Access the required field by reflection
* Copy the class and add a getter yourself
* Implement your own client from scratch


> JobExceptionsInfo. ExecutionExceptionInfo miss getter method
> 
>
> Key: FLINK-17595
> URL: https://issues.apache.org/jira/browse/FLINK-17595
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Wei Zhang
>Priority: Minor
> Fix For: 1.11.0
>
>
> {code:java}
>   public static final class ExecutionExceptionInfo {
>   public static final String FIELD_NAME_EXCEPTION = "exception";
>   public static final String FIELD_NAME_TASK = "task";
>   public static final String FIELD_NAME_LOCATION = "location";
>   public static final String FIELD_NAME_TIMESTAMP = "timestamp";
>   @JsonProperty(FIELD_NAME_EXCEPTION)
>   private final String exception;
>   @JsonProperty(FIELD_NAME_TASK)
>   private final String task;
>   @JsonProperty(FIELD_NAME_LOCATION)
>   private final String location;
>   @JsonProperty(FIELD_NAME_TIMESTAMP)
>   private final long timestamp;
>   @JsonCreator
>   public ExecutionExceptionInfo(
>   @JsonProperty(FIELD_NAME_EXCEPTION) String exception,
>   @JsonProperty(FIELD_NAME_TASK) String task,
>   @JsonProperty(FIELD_NAME_LOCATION) String location,
>   @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) {
>   this.exception = Preconditions.checkNotNull(exception);
>   this.task = Preconditions.checkNotNull(task);
>   this.location = Preconditions.checkNotNull(location);
>   this.timestamp = timestamp;
>   }
>   @Override
>   public boolean equals(Object o) {
>   if (this == o) {
>   return true;
>   }
>   if (o == null || getClass() != o.getClass()) {
>   return false;
>   }
>   JobExceptionsInfo.ExecutionExceptionInfo that = 
> (JobExceptionsInfo.ExecutionExceptionInfo) o;
>   return timestamp == that.timestamp &&
>   Objects.equals(exception, that.exception) &&
>   Objects.equals(task, that.task) &&
>   Objects.equals(location, that.location);
>   }
>   @Override
>   public int hashCode() {
>   return Objects.hash(timestamp, exception, task, 
> location);
>   }
> {code}
> I found jobexceptionsinfo.executionexceptioninfo has no getter method for the 
> field, is it missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"

2020-05-12 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17536:
-
Affects Version/s: 1.11.0

> Change the config option of slot max limitation to 
> "slotmanager.number-of-slots.max"
> 
>
> Key: FLINK-17536
> URL: https://issues.apache.org/jira/browse/FLINK-17536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Affects Versions: 1.11.0
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"

2020-05-12 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17536.

Resolution: Fixed

master: 9888a87495827b619f5b49dae5ad29a34931d0a9

> Change the config option of slot max limitation to 
> "slotmanager.number-of-slots.max"
> 
>
> Key: FLINK-17536
> URL: https://issues.apache.org/jira/browse/FLINK-17536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Affects Versions: 1.11.0
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17130) Web UI: Enable listing JM Logs and displaying Logs by Filename

2020-05-12 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17130.

Resolution: Fixed

master: 74b850cd4c6aeac5dd0d20852677d642c9703970

> Web UI: Enable listing JM Logs and displaying Logs by Filename
> --
>
> Key: FLINK-17130
> URL: https://issues.apache.org/jira/browse/FLINK-17130
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yadong Xie
>Assignee: Yadong Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> add log list and read log by name for jobmanager in the web



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17608) Add TM log and stdout page back

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17608.

Resolution: Fixed

master: 7cfcd33e983c6e07eedf8c0d5514450a565710ff

> Add TM log and stdout page back
> ---
>
> Key: FLINK-17608
> URL: https://issues.apache.org/jira/browse/FLINK-17608
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yadong Xie
>Assignee: Yadong Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> According to the discussion in 
> [https://github.com/apache/flink/pull/11731#issuecomment-620048458]
> TM log and stdout page should be added in order not to break the previous 
> user experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17608) Add TM log and stdout page back

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17608:
-
Description: 
According to the discussion in 
[https://github.com/apache/flink/pull/11731#issuecomment-620048458]

TM log and stdout page should be added in order not to break the previous user 
experience.

  was:
According to the discussion in 
[https://github.com/apache/flink/pull/11731#issuecomment-620048458]

TM log and stdout page should be added in order not to break the previous user 
exp


> Add TM log and stdout page back
> ---
>
> Key: FLINK-17608
> URL: https://issues.apache.org/jira/browse/FLINK-17608
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yadong Xie
>Assignee: Yadong Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> According to the discussion in 
> [https://github.com/apache/flink/pull/11731#issuecomment-620048458]
> TM log and stdout page should be added in order not to break the previous 
> user experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17621) Use default akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)
Gary Yao created FLINK-17621:


 Summary: Use default akka.ask.timeout in TPC-DS e2e test
 Key: FLINK-17621
 URL: https://issues.apache.org/jira/browse/FLINK-17621
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.11.0
Reporter: Gary Yao


Revert the changes in FLINK-17616



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"

2020-05-11 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104372#comment-17104372
 ] 

Gary Yao commented on FLINK-17536:
--

Sorry, missed your message. I have assigned you now.

> Change the config option of slot max limitation to 
> "slotmanager.number-of-slots.max"
> 
>
> Key: FLINK-17536
> URL: https://issues.apache.org/jira/browse/FLINK-17536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17536:


Assignee: Gary Yao

> Change the config option of slot max limitation to 
> "slotmanager.number-of-slots.max"
> 
>
> Key: FLINK-17536
> URL: https://issues.apache.org/jira/browse/FLINK-17536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Reporter: Yangze Guo
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17536) Change the config option of slot max limitation to "slotmanager.number-of-slots.max"

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17536:


Assignee: Yangze Guo  (was: Gary Yao)

> Change the config option of slot max limitation to 
> "slotmanager.number-of-slots.max"
> 
>
> Key: FLINK-17536
> URL: https://issues.apache.org/jira/browse/FLINK-17536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17616:
-
Description: Until FLINK-17558 is fixed, we should 

> Temporarily increase akka.ask.timeout in TPC-DS e2e test
> 
>
> Key: FLINK-17616
> URL: https://issues.apache.org/jira/browse/FLINK-17616
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Critical
> Fix For: 1.11.0
>
>
> Until FLINK-17558 is fixed, we should 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17616:
-
Priority: Critical  (was: Major)

> Temporarily increase akka.ask.timeout in TPC-DS e2e test
> 
>
> Key: FLINK-17616
> URL: https://issues.apache.org/jira/browse/FLINK-17616
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Critical
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17616:
-
Description: Until FLINK-17558 is fixed, we should increase the 
akka.ask.timeout in the e2e test to mitigate FLINK-17194   (was: Until 
FLINK-17558 is fixed, we should )

> Temporarily increase akka.ask.timeout in TPC-DS e2e test
> 
>
> Key: FLINK-17616
> URL: https://issues.apache.org/jira/browse/FLINK-17616
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Critical
> Fix For: 1.11.0
>
>
> Until FLINK-17558 is fixed, we should increase the akka.ask.timeout in the 
> e2e test to mitigate FLINK-17194 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)
Gary Yao created FLINK-17616:


 Summary: Temporarily increase akka.ask.timeout in TPC-DS e2e test
 Key: FLINK-17616
 URL: https://issues.apache.org/jira/browse/FLINK-17616
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17369.

Resolution: Fixed

master:
cba48459172f3888766c6ec753f3c45c6cd1d884
56cc76bbdaf6380781c118ad2e5d4fbfeca510ac
fd0ef6e672b5ac2f7cbd01fc5704e9e06c748016

> Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to 
> PipelinedRegionComputeUtilTest
> 
>
> Key: FLINK-17369
> URL: https://issues.apache.org/jira/browse/FLINK-17369
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually 
> testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests 
> should be moved to a new class {{PipelinedRegionComputeUtilTest}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17485) Add a thread dump REST API

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17485.

Resolution: Duplicate

Closing because it is a duplicate. Feel free to re-open if you think otherwise.

> Add a thread dump REST API
> --
>
> Key: FLINK-17485
> URL: https://issues.apache.org/jira/browse/FLINK-17485
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Reporter: Xingxing Di
>Priority: Major
>
> My team build a streaming computing platform based on flink in our company 
> internal.
> As jobs and users grow, we spent lot's of time to help user with 
> troubleshooting.
> Currently we must logon the server which running task manager, find the right 
> process through netstat -anp| grep "the flink data port", then run jstack 
> command.
> We think it will be very convenient if flink provide a REST API for thread 
> dumping, with web UI support event better.
> So we want to know:
>  * If community is already working on this
>  * Will this be a appropriate feature (add a REST API to dump threads), 
> because on the other hand, thread dump may be "expensive"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17595) JobExceptionsInfo. ExecutionExceptionInfo miss getter method

2020-05-11 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104196#comment-17104196
 ] 

Gary Yao commented on FLINK-17595:
--

[~zhangwei24] Can you explain why you need a getter? This class isn't public 
API.

> JobExceptionsInfo. ExecutionExceptionInfo miss getter method
> 
>
> Key: FLINK-17595
> URL: https://issues.apache.org/jira/browse/FLINK-17595
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Wei Zhang
>Priority: Minor
> Fix For: 1.11.0
>
>
> {code:java}
>   public static final class ExecutionExceptionInfo {
>   public static final String FIELD_NAME_EXCEPTION = "exception";
>   public static final String FIELD_NAME_TASK = "task";
>   public static final String FIELD_NAME_LOCATION = "location";
>   public static final String FIELD_NAME_TIMESTAMP = "timestamp";
>   @JsonProperty(FIELD_NAME_EXCEPTION)
>   private final String exception;
>   @JsonProperty(FIELD_NAME_TASK)
>   private final String task;
>   @JsonProperty(FIELD_NAME_LOCATION)
>   private final String location;
>   @JsonProperty(FIELD_NAME_TIMESTAMP)
>   private final long timestamp;
>   @JsonCreator
>   public ExecutionExceptionInfo(
>   @JsonProperty(FIELD_NAME_EXCEPTION) String exception,
>   @JsonProperty(FIELD_NAME_TASK) String task,
>   @JsonProperty(FIELD_NAME_LOCATION) String location,
>   @JsonProperty(FIELD_NAME_TIMESTAMP) long timestamp) {
>   this.exception = Preconditions.checkNotNull(exception);
>   this.task = Preconditions.checkNotNull(task);
>   this.location = Preconditions.checkNotNull(location);
>   this.timestamp = timestamp;
>   }
>   @Override
>   public boolean equals(Object o) {
>   if (this == o) {
>   return true;
>   }
>   if (o == null || getClass() != o.getClass()) {
>   return false;
>   }
>   JobExceptionsInfo.ExecutionExceptionInfo that = 
> (JobExceptionsInfo.ExecutionExceptionInfo) o;
>   return timestamp == that.timestamp &&
>   Objects.equals(exception, that.exception) &&
>   Objects.equals(task, that.task) &&
>   Objects.equals(location, that.location);
>   }
>   @Override
>   public int hashCode() {
>   return Objects.hash(timestamp, exception, task, 
> location);
>   }
> {code}
> I found jobexceptionsinfo.executionexceptioninfo has no getter method for the 
> field, is it missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17608) Add TM log and stdout page back

2020-05-11 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17608:


Assignee: Yadong Xie

> Add TM log and stdout page back
> ---
>
> Key: FLINK-17608
> URL: https://issues.apache.org/jira/browse/FLINK-17608
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yadong Xie
>Assignee: Yadong Xie
>Priority: Major
> Fix For: 1.11.0
>
>
> According to the discussion in 
> [https://github.com/apache/flink/pull/11731#issuecomment-620048458]
> TM log and stdout page should be added in order not to break the previous 
> user exp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-05-07 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101970#comment-17101970
 ] 

Gary Yao commented on FLINK-17194:
--

My theory is that we exhaust the IOPS credits towards the end of the tests and 
file I/O becomes really slow. Nonetheless, partitions should not be released in 
the main thread. I have created FLINK-17558 to track that issue. As a 
mitigation we could temporarily increase the akka ask timeout. 

> TPC-DS end-to-end test fails due to missing execution attempt
> -
>
> Key: FLINK-17194
> URL: https://issues.apache.org/jira/browse/FLINK-17194
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Chesnay Schepler
>Assignee: Gary Yao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17558) Partitions are released in TaskExecutor Main Thread

2020-05-07 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17558:
-
Description: 
Partitions are released in the main thread of the TaskExecutor (see the 
stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. 
because deleting files is blocking I/O. The partitions should be released in a 
devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate but 
requires a higher default thread count). 

{noformat}
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 
prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable 
[0x7f7fd302c000]
2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE
2020-05-06T19:13:12.4384519Zat 
sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method)
2020-05-06T19:13:12.4384971Zat 
sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146)
2020-05-06T19:13:12.4385465Zat 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231)
2020-05-06T19:13:12.4386000Zat 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126)
2020-05-06T19:13:12.4386968Zat 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93)
2020-05-06T19:13:12.4388088Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247)
2020-05-06T19:13:12.4388765Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208)
2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a 
java.lang.Object)
2020-05-06T19:13:12.4389905Zat 
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290)
2020-05-06T19:13:12.4390481Zat 
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80)
2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a 
java.util.HashMap)
2020-05-06T19:13:12.4391597Zat 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153)
2020-05-06T19:13:12.4392267Zat 
org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62)
2020-05-06T19:13:12.4392914Zat 
org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776)
2020-05-06T19:13:12.4393366Zat 
sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
2020-05-06T19:13:12.4393813Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-05-06T19:13:12.4394257Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2020-05-06T19:13:12.4394693Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
2020-05-06T19:13:12.4395202Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)
2020-05-06T19:13:12.4395686Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
2020-05-06T19:13:12.4396165Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown
 Source)
2020-05-06T19:13:12.4396606Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2020-05-06T19:13:12.4397015Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2020-05-06T19:13:12.4397447Zat 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2020-05-06T19:13:12.4397874Zat 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2020-05-06T19:13:12.4398414Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2020-05-06T19:13:12.4398879Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399321Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399737Zat 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2020-05-06T19:13:12.4400138Zat 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2020-05-06T19:13:12.4400552Zat 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2020-05-06T19:13:12.4400930Zat 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2020-05-06T19:13:12.4401390Zat 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225)
2020-05-06T19:13:12.4402135Zat akka.dispatch.Mailbox.exec(Mailbox.scala:235)
2020-05-06T19:13:12.4402540Zat 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2020-05-06T19:13:12.4402984Zat 

[jira] [Created] (FLINK-17558) Partitions are released in TaskExecutor Main Thread

2020-05-07 Thread Gary Yao (Jira)
Gary Yao created FLINK-17558:


 Summary: Partitions are released in TaskExecutor Main Thread
 Key: FLINK-17558
 URL: https://issues.apache.org/jira/browse/FLINK-17558
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.11.0
Reporter: Gary Yao
 Fix For: 1.11.0


Partitions are released in the main thread of the TaskExecutor (see the 
stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. 
because deleting files is blocking I/O. The partitions should be released in a 
devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate). 

{noformat}
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 
prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable 
[0x7f7fd302c000]
2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE
2020-05-06T19:13:12.4384519Zat 
sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method)
2020-05-06T19:13:12.4384971Zat 
sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146)
2020-05-06T19:13:12.4385465Zat 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231)
2020-05-06T19:13:12.4386000Zat 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126)
2020-05-06T19:13:12.4386968Zat 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93)
2020-05-06T19:13:12.4388088Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247)
2020-05-06T19:13:12.4388765Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208)
2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a 
java.lang.Object)
2020-05-06T19:13:12.4389905Zat 
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290)
2020-05-06T19:13:12.4390481Zat 
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80)
2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a 
java.util.HashMap)
2020-05-06T19:13:12.4391597Zat 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153)
2020-05-06T19:13:12.4392267Zat 
org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62)
2020-05-06T19:13:12.4392914Zat 
org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776)
2020-05-06T19:13:12.4393366Zat 
sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
2020-05-06T19:13:12.4393813Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-05-06T19:13:12.4394257Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2020-05-06T19:13:12.4394693Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
2020-05-06T19:13:12.4395202Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)
2020-05-06T19:13:12.4395686Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
2020-05-06T19:13:12.4396165Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown
 Source)
2020-05-06T19:13:12.4396606Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2020-05-06T19:13:12.4397015Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2020-05-06T19:13:12.4397447Zat 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2020-05-06T19:13:12.4397874Zat 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2020-05-06T19:13:12.4398414Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2020-05-06T19:13:12.4398879Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399321Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399737Zat 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2020-05-06T19:13:12.4400138Zat 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2020-05-06T19:13:12.4400552Zat 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2020-05-06T19:13:12.4400930Zat 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2020-05-06T19:13:12.4401390Zat 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225)
2020-05-06T19:13:12.4402135Zat 

[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-05-06 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101125#comment-17101125
 ] 

Gary Yao commented on FLINK-17194:
--

I am able to reliably reproduce this issue by setting {{akka.ask.timeout}} to 
{{5s}}, which should be still a generous timeout for a local Flink cluster.
The problem seems to be that sometimes releasing a partition is slow (due to 
file io), and this blocks the TaskExecutor's main thread. I have attached an 
example below. 


{noformat}
Run TPC-DS query 39b ...
{noformat}


{noformat}
2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.ResultPartition [] - 
CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, 
inv_quantity_on_hand) -> 
SourceConversion(table=[default_catalog.default_database.inventory, source: 
[CsvTableSource(read fields: inv_date_sk, inv_item_sk
, inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, 
inv_warehouse_sk, inv_quantity_on_hand]) (3/4) 
(76d0879cdd3bdb851b44f8dbb5b30999): Releasing ResultPartition 
feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 
1 subpartitions].
2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - 
Close 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0
2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - 
Closed 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0
2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.ResultPartition [] - 
CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, 
inv_quantity_on_hand) -> 
SourceConversion(table=[default_catalog.default_database.inventory, source: 
[CsvTableSource(read fields: inv_date_sk, inv_item_sk
, inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, 
inv_warehouse_sk, inv_quantity_on_hand]) (3/4) 
(76d0879cdd3bdb851b44f8dbb5b30999): Released ResultPartition 
feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 
1 subpartitions].
{noformat}

Note that it takes more than 9 seconds to release the partition. I have added 
additional debug prints.

I have also managed to invoke jstack at the right time on the TM process. The 
main thread is blocked on deleting {{FileChannelBoundedData#filePath}}.

{noformat}
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 
prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable 
[0x7f7fd302c000]
2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE
2020-05-06T19:13:12.4384519Zat 
sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method)
2020-05-06T19:13:12.4384971Zat 
sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146)
2020-05-06T19:13:12.4385465Zat 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231)
2020-05-06T19:13:12.4386000Zat 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126)
2020-05-06T19:13:12.4386968Zat 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93)
2020-05-06T19:13:12.4388088Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247)
2020-05-06T19:13:12.4388765Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208)
2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a 
java.lang.Object)
2020-05-06T19:13:12.4389905Zat 
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290)
2020-05-06T19:13:12.4390481Zat 
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80)
2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a 
java.util.HashMap)
2020-05-06T19:13:12.4391597Zat 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153)
2020-05-06T19:13:12.4392267Zat 
org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62)
2020-05-06T19:13:12.4392914Zat 
org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776)
2020-05-06T19:13:12.4393366Zat 
sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
2020-05-06T19:13:12.4393813Zat 

[jira] [Comment Edited] (FLINK-17485) Add a thread dump REST API

2020-05-06 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100643#comment-17100643
 ] 

Gary Yao edited comment on FLINK-17485 at 5/6/20, 10:02 AM:


[~dixingx...@yeah.net] Is FLINK-14816 what you need?


was (Author: gjy):
Is FLINK-14816 what you need?

> Add a thread dump REST API
> --
>
> Key: FLINK-17485
> URL: https://issues.apache.org/jira/browse/FLINK-17485
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Reporter: Xingxing Di
>Priority: Major
>
> My team build a streaming computing platform based on flink in our company 
> internal.
> As jobs and users grow, we spent lot's of time to help user with 
> troubleshooting.
> Currently we must logon the server which running task manager, find the right 
> process through netstat -anp| grep "the flink data port", then run jstack 
> command.
> We think it will be very convenient if flink provide a REST API for thread 
> dumping, with web UI support event better.
> So we want to know:
>  * If community is already working on this
>  * Will this be a appropriate feature (add a REST API to dump threads), 
> because on the other hand, thread dump may be "expensive"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17485) Add a thread dump REST API

2020-05-06 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100643#comment-17100643
 ] 

Gary Yao commented on FLINK-17485:
--

Is FLINK-14816 what you need?

> Add a thread dump REST API
> --
>
> Key: FLINK-17485
> URL: https://issues.apache.org/jira/browse/FLINK-17485
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Reporter: Xingxing Di
>Priority: Major
>
> My team build a streaming computing platform based on flink in our company 
> internal.
> As jobs and users grow, we spent lot's of time to help user with 
> troubleshooting.
> Currently we must logon the server which running task manager, find the right 
> process through netstat -anp| grep "the flink data port", then run jstack 
> command.
> We think it will be very convenient if flink provide a REST API for thread 
> dumping, with web UI support event better.
> So we want to know:
>  * If community is already working on this
>  * Will this be a appropriate feature (add a REST API to dump threads), 
> because on the other hand, thread dump may be "expensive"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-06 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100568#comment-17100568
 ] 

Gary Yao commented on FLINK-13553:
--

In case it happens again, I added more debug information in 
c2540e44058a313a8dc7251dd5d37d2d52db2b44

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)

2020-05-06 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17501.

Resolution: Fixed

master: c2540e44058a313a8dc7251dd5d37d2d52db2b44

> Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, 
> Object)
> ---
>
> Key: FLINK-17501
> URL: https://issues.apache.org/jira/browse/FLINK-17501
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, 
> Object)}}. If an Error is thrown, it should be logged as early as possible. 
> Currently we try to serialize and send an error response to the client before 
> logging the error; this can fail and mask the original exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17522) Document flink-jepsen Command Line Options

2020-05-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-17522:


 Summary: Document flink-jepsen Command Line Options
 Key: FLINK-17522
 URL: https://issues.apache.org/jira/browse/FLINK-17522
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Reporter: Gary Yao
Assignee: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)

2020-05-04 Thread Gary Yao (Jira)
Gary Yao created FLINK-17501:


 Summary: Improve logging in 
AbstractServerHandler#channelRead(ChannelHandlerContext, Object)
 Key: FLINK-17501
 URL: https://issues.apache.org/jira/browse/FLINK-17501
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Queryable State
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, 
Object)}}. If an Error is thrown, it should be logged as early as possible. 
Currently we try to serialize and send an error response to the client before 
logging the error; this can fail and mask the original exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder

2020-05-04 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17473.

Resolution: Fixed

master: d385f20f607875a625df8b95917f7a8eaacea4a6

> Remove unused classes ArchivedExecutionBuilder, 
> ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
> -
>
> Key: FLINK-17473
> URL: https://issues.apache.org/jira/browse/FLINK-17473
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Remove unused test classes {{ArchivedExecutionBuilder}}, 
> {{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder

2020-05-04 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17473:
-
Description: Remove unused test classes {{ArchivedExecutionBuilder}}, 
{{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}.  
(was: Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
{{ArchivedExecutionJobVertexBuilder}})

> Remove unused classes ArchivedExecutionBuilder, 
> ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
> -
>
> Key: FLINK-17473
> URL: https://issues.apache.org/jira/browse/FLINK-17473
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Remove unused test classes {{ArchivedExecutionBuilder}}, 
> {{ArchivedExecutionVertexBuilder}}, and {{ArchivedExecutionJobVertexBuilder}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder

2020-05-04 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17473:
-
Summary: Remove unused classes ArchivedExecutionBuilder, 
ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder  (was: 
Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder 
and ArchivedExecutionJobVertexBuilder)

> Remove unused classes ArchivedExecutionBuilder, 
> ArchivedExecutionVertexBuilder, and ArchivedExecutionJobVertexBuilder
> -
>
> Key: FLINK-17473
> URL: https://issues.apache.org/jira/browse/FLINK-17473
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
> {{ArchivedExecutionJobVertexBuilder}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionBuilder, ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder

2020-05-04 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17473:
-
Summary: Remove unused classes ArchivedExecutionBuilder, 
ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder  (was: 
Remove unused classes ArchivedExecutionVertexBuilder and 
ArchivedExecutionJobVertexBuilder)

> Remove unused classes ArchivedExecutionBuilder, 
> ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder
> 
>
> Key: FLINK-17473
> URL: https://issues.apache.org/jira/browse/FLINK-17473
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
> {{ArchivedExecutionJobVertexBuilder}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-04 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098710#comment-17098710
 ] 

Gary Yao commented on FLINK-13553:
--

I am unable to reproduce this locally and on Azure. On Travis the builds are 
timing out consistently during compilation stage.

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-04 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-13553:


Assignee: (was: Gary Yao)

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-05-04 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098710#comment-17098710
 ] 

Gary Yao edited comment on FLINK-13553 at 5/4/20, 6:37 AM:
---

I am unable to reproduce this locally and on Azure. On Travis the builds are 
timing out consistently during compilation stage. I have unassigned myself now 
since I am unable to make progress at the moment.


was (Author: gjy):
I am unable to reproduce this locally and on Azure. On Travis the builds are 
timing out consistently during compilation stage.

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder

2020-04-30 Thread Gary Yao (Jira)
Gary Yao created FLINK-17473:


 Summary: Remove unused classes ArchivedExecutionVertexBuilder and 
ArchivedExecutionJobVertexBuilder
 Key: FLINK-17473
 URL: https://issues.apache.org/jira/browse/FLINK-17473
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
{{ArchivedExecutionJobVertexBuilder}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder

2020-04-30 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17473:
-
Issue Type: Task  (was: Bug)

> Remove unused classes ArchivedExecutionVertexBuilder and 
> ArchivedExecutionJobVertexBuilder
> --
>
> Key: FLINK-17473
> URL: https://issues.apache.org/jira/browse/FLINK-17473
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
> Fix For: 1.11.0
>
>
> Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
> {{ArchivedExecutionJobVertexBuilder}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-04-29 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095327#comment-17095327
 ] 

Gary Yao commented on FLINK-17194:
--

The root exception seems to be

{code}
java.util.concurrent.CompletionException: 
java.util.concurrent.TimeoutException: Invocation of public abstract 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.taskexecutor.TaskExecutorGateway.submitTask(org.apache.flink.runtime.deployment.TaskDeploymentDescriptor,org.apache.flink.runtime.jobmaster.JobMasterId,org.apache.flink.api.common.time.Time)
 timed out.
at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) 
~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
 ~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925) 
~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
 ~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_242]
at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:227)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 ~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 ~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_242]
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_242]
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:888)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:263) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:261) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
Caused by: java.util.concurrent.TimeoutException: Invocation of public abstract 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.taskexecutor.TaskExecutorGateway.submitTask(org.apache.flink.runtime.deployment.TaskDeploymentDescriptor,org.apache.flink.runtime.jobmaster.JobMasterId,org.apache.flink.api.common.time.Time)
 timed out.
at 
org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.submitTask(RpcTaskManagerGateway.java:72)
 

[jira] [Closed] (FLINK-16605) Add max limitation to the total number of slots

2020-04-29 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-16605.

Resolution: Fixed

master:
026a2b6d8ed3aab5bc29d998ba6e585fa5b2d9ef
9e69b270c8b192876dae128541aa73ae6e788e2f
dcf9cc601f6ee1bb90a5d548043564a1a0522a25

> Add max limitation to the total number of slots
> ---
>
> Key: FLINK-16605
> URL: https://issues.apache.org/jira/browse/FLINK-16605
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in FLINK-15527 and FLINK-15959, we propose to add the max limit 
> to the total number of slots.
> To be specific:
> - Introduce "cluster.number-of-slots.max" configuration option with default 
> value MAX_INT
> - Make the SlotManager respect the max number of slots, when exceeded, it 
> would not allocate resource anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17194) TPC-DS end-to-end test fails due to missing execution attempt

2020-04-28 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-17194:


Assignee: Gary Yao

> TPC-DS end-to-end test fails due to missing execution attempt
> -
>
> Key: FLINK-17194
> URL: https://issues.apache.org/jira/browse/FLINK-17194
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Chesnay Schepler
>Assignee: Gary Yao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest

2020-04-28 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17369:
-
Parent: FLINK-16430
Issue Type: Sub-task  (was: Task)

> Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to 
> PipelinedRegionComputeUtilTest
> 
>
> Key: FLINK-17369
> URL: https://issues.apache.org/jira/browse/FLINK-17369
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually 
> testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests 
> should be moved to a new class {{PipelinedRegionComputeUtilTest}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest

2020-04-28 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17369:
-
Issue Type: Task  (was: Bug)

> Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to 
> PipelinedRegionComputeUtilTest
> 
>
> Key: FLINK-17369
> URL: https://issues.apache.org/jira/browse/FLINK-17369
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually 
> testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests 
> should be moved to a new class {{PipelinedRegionComputeUtilTest}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2020-04-28 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-13553:


Assignee: Gary Yao

> KvStateServerHandlerTest.readInboundBlocking unstable on Travis
> ---
>
> Key: FLINK-13553
> URL: https://issues.apache.org/jira/browse/FLINK-13553
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{KvStateServerHandlerTest.readInboundBlocking}} and 
> {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
> {{TimeoutException}}.
> https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology

2020-04-24 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao closed FLINK-17180.

Resolution: Fixed

master:
23c13bbdbfa1b538a6c9e4e9622ef4563f69cd03
95b3c955f115dacb58b9695ae4192f729f5d5662
f9c23a0b86121d6361df403a05f75ba4b3902735

> Implement PipelinedRegion interface for SchedulingTopology
> --
>
> Key: FLINK-17180
> URL: https://issues.apache.org/jira/browse/FLINK-17180
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Implement {{Toplogy#getAllPipelinedRegions()}} and 
> {{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in 
> {{DefaultExecutionTopology}} to enable retrieval of pipelined regions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology

2020-04-24 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-17180:
-
Description: Implement {{Toplogy#getAllPipelinedRegions()}} and 
{{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in 
{{DefaultExecutionTopology}} to enable retrieval of pipelined regions.

> Implement PipelinedRegion interface for SchedulingTopology
> --
>
> Key: FLINK-17180
> URL: https://issues.apache.org/jira/browse/FLINK-17180
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Implement {{Toplogy#getAllPipelinedRegions()}} and 
> {{Topology#getPipelinedRegionOfVertex(ExecutionVertexID)}} in 
> {{DefaultExecutionTopology}} to enable retrieval of pipelined regions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api

2020-04-24 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091368#comment-17091368
 ] 

Gary Yao commented on FLINK-17328:
--

I assigned you but I cannot promise a timely review at the moment.

> Expose network metric for job vertex in rest api
> 
>
> Key: FLINK-17328
> URL: https://issues.apache.org/jira/browse/FLINK-17328
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics, Runtime / REST
>Reporter: lining
>Assignee: lining
>Priority: Major
>
> JobVertexDetailsHandler
>  * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, 
> inputFloatingBuffersUsageAvg
>  * back-pressured for show whether it is back pressured(merge all iths 
> subtasks)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >