Re: apache flink: Why checkpoint coordinator takes long time to get completion
Hi Zili, here is the release notes for 1.8.1 https://flink.apache.org/news/2019/07/02/release-1.8.1.html But I could not find any ticket related to the "unexpected time-consuming", I have just tested our application with both versions, this issue is be able to reproduce every time with version 1.8.0, and it does not happen with version 1.8.1 until now. Best regards Xiangyu On Tue, 23 Jul 2019 at 08:49, Zili Chen wrote: > Hi Xiangyu, > > Could you share the corresponding JIRA that fixed this issue? > > Best, > tison. > > > Xiangyu Su 于2019年7月19日周五 下午8:47写道: > >> btw. it seems like this issue has been fixed in 1.8.1 >> >> On Fri, 19 Jul 2019 at 12:21, Xiangyu Su wrote: >> >>> Ok, thanks. >>> >>> and this time-consuming until now always happens after 3rd >>> checkpointing, and this unexpected time-consuming was always consistent (~ >>> 4 min by under 4G/min incoming traffic). >>> >>> On Fri, 19 Jul 2019 at 11:06, Biao Liu wrote: >>> Hi Xiangyu, Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is. Xiangyu Su 于2019年7月19日周五 下午4:14写道: > Dear flink community, > > We are POC flink(1.8) to process data in real time, and using global > checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our > application is consuming data from Kinesis. > > For my test e.g I am using checkpointing interval 5min. and minimum > pause 2min. > > The issue what we saw is: It seems like flink checkpointing process > would be idle for 3-4 min, before job manager get complete notification. > > here is some logging from job manager: > > 2019-07-10 11:59:03,893 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Triggering checkpoint 4 @ 1562759941082 for job > e7a97014f5799458f1c656135712813d. > 2019-07-10 12:05:01,836 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes > in 58645 ms). > > As my understanding the logging above, the > completedCheckpoint(CheckpointCoordinator) > object has been completed in 58645 ms, but the whole checkpointing process > took ~ 6min. > > This logging is for 4th checkpointing, But the first 3 checkpointing > were finished on time. > Could you please tell me, why flink checkpointing in my test was > starting "idle" for few minutes after 3 checkpointing? > > Best Regards > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > Valentinskamp 70, Emporio, 19th Floor > 20355 Hamburg > M 0049(176)22943076 > > The information contained in this communication may be CONFIDENTIAL > and is intended only for the use of the recipient(s) named above. If you > are not the intended recipient, you are hereby notified that any > dissemination, distribution, or copying of this communication, or any of > its contents, is strictly prohibited. If you have received this > communication in error, please notify the sender and delete/destroy the > original message and any copy of it from your computer or paper files. > >>> >>> -- >>> Xiangyu Su >>> Java Developer >>> xian...@smaato.com >>> >>> Smaato Inc. >>> San Francisco - New York - Hamburg - Singapore >>> www.smaato.com >>> >>> Germany: >>> Valentinskamp 70, Emporio, 19th Floor >>> 20355 Hamburg >>> M 0049(176)22943076 >>> >>> The information contained in this communication may be CONFIDENTIAL and >>> is intended only for the use of the recipient(s) named above. If you are >>> not the intended recipient, you are hereby notified that any dissemination, >>> distribution, or copying of this communication, or any of its contents, is >>> strictly prohibited. If you have received this communication in error, >>> please notify the sender and delete/destroy the original message and any >>> copy of it from your computer or paper files. >>> >> >> >> -- >> Xiangyu Su >> Java Developer >> xian...@smaato.com >> >> Smaato Inc. >> San Francisco - New York - Hamburg - Singapore >> www.smaato.com >> >> Germany: >> Valentinskamp 70, Emporio, 19th Floor >> 20355 Hamburg >> M 0049(176)22943076 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in
Re: apache flink: Why checkpoint coordinator takes long time to get completion
Hi Xiangyu, Could you share the corresponding JIRA that fixed this issue? Best, tison. Xiangyu Su 于2019年7月19日周五 下午8:47写道: > btw. it seems like this issue has been fixed in 1.8.1 > > On Fri, 19 Jul 2019 at 12:21, Xiangyu Su wrote: > >> Ok, thanks. >> >> and this time-consuming until now always happens after 3rd checkpointing, >> and this unexpected time-consuming was always consistent (~ 4 min by under >> 4G/min incoming traffic). >> >> On Fri, 19 Jul 2019 at 11:06, Biao Liu wrote: >> >>> Hi Xiangyu, >>> >>> Just took a glance at the relevant codes. There is a gap between >>> calculating the duration and logging it out. I guess the checkpoint 4 is >>> finished in 1 minute, but there is an unexpected time-consuming operation >>> during that time. But I can't tell which part it is. >>> >>> >>> Xiangyu Su 于2019年7月19日周五 下午4:14写道: >>> Dear flink community, We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis. For my test e.g I am using checkpointing interval 5min. and minimum pause 2min. The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification. here is some logging from job manager: 2019-07-10 11:59:03,893 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d. 2019-07-10 12:05:01,836 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms). As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min. This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time. Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing? Best Regards -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Valentinskamp 70, Emporio, 19th Floor 20355 Hamburg M 0049(176)22943076 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. >>> >> >> -- >> Xiangyu Su >> Java Developer >> xian...@smaato.com >> >> Smaato Inc. >> San Francisco - New York - Hamburg - Singapore >> www.smaato.com >> >> Germany: >> Valentinskamp 70, Emporio, 19th Floor >> 20355 Hamburg >> M 0049(176)22943076 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in error, >> please notify the sender and delete/destroy the original message and any >> copy of it from your computer or paper files. >> > > > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > Valentinskamp 70, Emporio, 19th Floor > 20355 Hamburg > M 0049(176)22943076 > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. >
Re: apache flink: Why checkpoint coordinator takes long time to get completion
btw. it seems like this issue has been fixed in 1.8.1 On Fri, 19 Jul 2019 at 12:21, Xiangyu Su wrote: > Ok, thanks. > > and this time-consuming until now always happens after 3rd checkpointing, > and this unexpected time-consuming was always consistent (~ 4 min by under > 4G/min incoming traffic). > > On Fri, 19 Jul 2019 at 11:06, Biao Liu wrote: > >> Hi Xiangyu, >> >> Just took a glance at the relevant codes. There is a gap between >> calculating the duration and logging it out. I guess the checkpoint 4 is >> finished in 1 minute, but there is an unexpected time-consuming operation >> during that time. But I can't tell which part it is. >> >> >> Xiangyu Su 于2019年7月19日周五 下午4:14写道: >> >>> Dear flink community, >>> >>> We are POC flink(1.8) to process data in real time, and using global >>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our >>> application is consuming data from Kinesis. >>> >>> For my test e.g I am using checkpointing interval 5min. and minimum >>> pause 2min. >>> >>> The issue what we saw is: It seems like flink checkpointing process >>> would be idle for 3-4 min, before job manager get complete notification. >>> >>> here is some logging from job manager: >>> >>> 2019-07-10 11:59:03,893 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >>> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d. >>> 2019-07-10 12:05:01,836 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in >>> 58645 ms). >>> >>> As my understanding the logging above, the >>> completedCheckpoint(CheckpointCoordinator) >>> object has been completed in 58645 ms, but the whole checkpointing process >>> took ~ 6min. >>> >>> This logging is for 4th checkpointing, But the first 3 checkpointing >>> were finished on time. >>> Could you please tell me, why flink checkpointing in my test was >>> starting "idle" for few minutes after 3 checkpointing? >>> >>> Best Regards >>> -- >>> Xiangyu Su >>> Java Developer >>> xian...@smaato.com >>> >>> Smaato Inc. >>> San Francisco - New York - Hamburg - Singapore >>> www.smaato.com >>> >>> Germany: >>> Valentinskamp 70, Emporio, 19th Floor >>> 20355 Hamburg >>> M 0049(176)22943076 >>> >>> The information contained in this communication may be CONFIDENTIAL and >>> is intended only for the use of the recipient(s) named above. If you are >>> not the intended recipient, you are hereby notified that any dissemination, >>> distribution, or copying of this communication, or any of its contents, is >>> strictly prohibited. If you have received this communication in error, >>> please notify the sender and delete/destroy the original message and any >>> copy of it from your computer or paper files. >>> >> > > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > Valentinskamp 70, Emporio, 19th Floor > 20355 Hamburg > M 0049(176)22943076 > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. > -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Valentinskamp 70, Emporio, 19th Floor 20355 Hamburg M 0049(176)22943076 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Re: apache flink: Why checkpoint coordinator takes long time to get completion
Ok, thanks. and this time-consuming until now always happens after 3rd checkpointing, and this unexpected time-consuming was always consistent (~ 4 min by under 4G/min incoming traffic). On Fri, 19 Jul 2019 at 11:06, Biao Liu wrote: > Hi Xiangyu, > > Just took a glance at the relevant codes. There is a gap between > calculating the duration and logging it out. I guess the checkpoint 4 is > finished in 1 minute, but there is an unexpected time-consuming operation > during that time. But I can't tell which part it is. > > > Xiangyu Su 于2019年7月19日周五 下午4:14写道: > >> Dear flink community, >> >> We are POC flink(1.8) to process data in real time, and using global >> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our >> application is consuming data from Kinesis. >> >> For my test e.g I am using checkpointing interval 5min. and minimum pause >> 2min. >> >> The issue what we saw is: It seems like flink checkpointing process would >> be idle for 3-4 min, before job manager get complete notification. >> >> here is some logging from job manager: >> >> 2019-07-10 11:59:03,893 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d. >> 2019-07-10 12:05:01,836 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in >> 58645 ms). >> >> As my understanding the logging above, the >> completedCheckpoint(CheckpointCoordinator) >> object has been completed in 58645 ms, but the whole checkpointing process >> took ~ 6min. >> >> This logging is for 4th checkpointing, But the first 3 checkpointing were >> finished on time. >> Could you please tell me, why flink checkpointing in my test was starting >> "idle" for few minutes after 3 checkpointing? >> >> Best Regards >> -- >> Xiangyu Su >> Java Developer >> xian...@smaato.com >> >> Smaato Inc. >> San Francisco - New York - Hamburg - Singapore >> www.smaato.com >> >> Germany: >> Valentinskamp 70, Emporio, 19th Floor >> 20355 Hamburg >> M 0049(176)22943076 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in error, >> please notify the sender and delete/destroy the original message and any >> copy of it from your computer or paper files. >> > -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Valentinskamp 70, Emporio, 19th Floor 20355 Hamburg M 0049(176)22943076 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Re: apache flink: Why checkpoint coordinator takes long time to get completion
Hi Xiangyu, Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is. Xiangyu Su 于2019年7月19日周五 下午4:14写道: > Dear flink community, > > We are POC flink(1.8) to process data in real time, and using global > checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our > application is consuming data from Kinesis. > > For my test e.g I am using checkpointing interval 5min. and minimum pause > 2min. > > The issue what we saw is: It seems like flink checkpointing process would > be idle for 3-4 min, before job manager get complete notification. > > here is some logging from job manager: > > 2019-07-10 11:59:03,893 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d. > 2019-07-10 12:05:01,836 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in > 58645 ms). > > As my understanding the logging above, the > completedCheckpoint(CheckpointCoordinator) > object has been completed in 58645 ms, but the whole checkpointing process > took ~ 6min. > > This logging is for 4th checkpointing, But the first 3 checkpointing were > finished on time. > Could you please tell me, why flink checkpointing in my test was starting > "idle" for few minutes after 3 checkpointing? > > Best Regards > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > Valentinskamp 70, Emporio, 19th Floor > 20355 Hamburg > M 0049(176)22943076 > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. >
apache flink: Why checkpoint coordinator takes long time to get completion
Dear flink community, We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis. For my test e.g I am using checkpointing interval 5min. and minimum pause 2min. The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification. here is some logging from job manager: 2019-07-10 11:59:03,893 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d. 2019-07-10 12:05:01,836 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms). As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min. This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time. Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing? Best Regards -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Valentinskamp 70, Emporio, 19th Floor 20355 Hamburg M 0049(176)22943076 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.