[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-647147788 @flinkbot run azure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-647147774 @flinkbot run travis This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-647079845 @flinkbot run travis This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-647079689 @flinkbot run azure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-646864181 @pnowojski @rkhachatryan I think what @rkhachatryan proposed is the most concise and scalable solution. I rebased master with some small changes. Please take a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-646809118 @rkhachatryan I think the solution you proposed it much better. Let me update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-646736438 @pnowojski If different threads are going to request PartitionRequestClient for different connectionId, there is no resource conflicts. Thus, they can parallel run. But if more than two threads try to build PartitionRequestClient for the same connectionId, as @rkhachatryan pointed out the deadlock will happen. @pnowojski Would you please help to trigger to Azure test? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-646733511 @rkhachatryan Thanks for giving feedback on the PR. I think the PR is to tackle different things compare to taskmanager.network.request-backoff.xxx. taskmanager.network.request-backoff.xxx is translated into delayMs in NettyPartitionRequestClient to request sub Partition. It is after the TCP connection is built. The retry added here is to make TCP connection creation more reliable. Did you test the latest version that has added synchronized on connectionID. It works locally from my side now. The Azure test has been not triggered it yet. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-643845760 @pnowojski Yes, you are right. It is deadline for creating channel clients for the same connection id. To achieve a parallel connection building process for different connection id and also a synchronized way for the requests to the same connection id, I chose to synchronize connectionId in the createPartitionRequestClient function. It can guarantee the correctness with scarifying the scalability of the connection creation. Actually, we don't need to even use concurrent map. How do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-642838317 @pnowojski Oh, I see. I didn't check the class level to ignore. how about putting ignore only in the method testResourceReleaseAfterInterruptedConnect? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-638957276 @pnowojski For your comments. 1) There is no test that is marked to ignore by this PR. Would you please double-check? 2) Rebased master again. 3) Add join logic to wait for all of the threads successfully executed. I will try my best to make the PR needs the requirement so that it will use less time to merge it later. As it is needed for our internal release, hopefully, it can be merged soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] HuangZhenQiu commented on pull request #11541: [FLINK-15416][network] add task manager netty client retry mechenism
HuangZhenQiu commented on pull request #11541: URL: https://github.com/apache/flink/pull/11541#issuecomment-638588430 @pnowojski @zhijiangW I agree that the global lock will be concise in terms of implementation. But it will be big change in the class, I also prefer to leave it as a separate PR. According to your suggestions, I added a multiple thread test case. Please take a look when you have time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org