Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-23 Thread Xiangyu Su
Hi Zili,

here is the release notes for 1.8.1
https://flink.apache.org/news/2019/07/02/release-1.8.1.html
But I could not find any ticket related to the "unexpected time-consuming",
I have just tested our application with both versions, this issue is be
able to reproduce every time with version 1.8.0, and it does not happen
with version 1.8.1 until now.

Best regards
Xiangyu

On Tue, 23 Jul 2019 at 08:49, Zili Chen  wrote:

> Hi Xiangyu,
>
> Could you share the corresponding JIRA that fixed this issue?
>
> Best,
> tison.
>
>
> Xiangyu Su  于2019年7月19日周五 下午8:47写道:
>
>> btw. it seems like this issue has been fixed in 1.8.1
>>
>> On Fri, 19 Jul 2019 at 12:21, Xiangyu Su  wrote:
>>
>>> Ok, thanks.
>>>
>>> and this time-consuming until now always happens after 3rd
>>> checkpointing, and this unexpected  time-consuming was always consistent (~
>>> 4 min by under 4G/min incoming traffic).
>>>
>>> On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:
>>>
 Hi Xiangyu,

 Just took a glance at the relevant codes. There is a gap between
 calculating the duration and logging it out. I guess the checkpoint 4 is
 finished in 1 minute, but there is an unexpected time-consuming operation
 during that time. But I can't tell which part it is.


 Xiangyu Su  于2019年7月19日周五 下午4:14写道:

> Dear flink community,
>
> We are POC flink(1.8) to process data in real time, and using global
> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
> application is consuming data from Kinesis.
>
> For my test e.g I am using checkpointing interval 5min. and minimum
> pause 2min.
>
> The issue what we saw is: It seems like flink checkpointing process
> would be idle for 3-4 min, before job manager get complete notification.
>
> here is some logging from job manager:
>
> 2019-07-10 11:59:03,893 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - 
> Triggering checkpoint 4 @ 1562759941082 for job 
> e7a97014f5799458f1c656135712813d.
> 2019-07-10 12:05:01,836 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes 
> in 58645 ms).
>
> As my understanding the logging above, the 
> completedCheckpoint(CheckpointCoordinator)
> object has been completed in 58645 ms, but the whole checkpointing process
> took ~ 6min.
>
> This logging is for 4th checkpointing, But the first 3 checkpointing
> were finished on time.
> Could you please tell me, why flink checkpointing in my test was
> starting "idle" for few minutes after 3 checkpointing?
>
> Best Regards
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Valentinskamp 70, Emporio, 19th Floor
> 20355 Hamburg
> M 0049(176)22943076
>
> The information contained in this communication may be CONFIDENTIAL
> and is intended only for the use of the recipient(s) named above. If you
> are not the intended recipient, you are hereby notified that any
> dissemination, distribution, or copying of this communication, or any of
> its contents, is strictly prohibited. If you have received this
> communication in error, please notify the sender and delete/destroy the
> original message and any copy of it from your computer or paper files.
>

>>>
>>> --
>>> Xiangyu Su
>>> Java Developer
>>> xian...@smaato.com
>>>
>>> Smaato Inc.
>>> San Francisco - New York - Hamburg - Singapore
>>> www.smaato.com
>>>
>>> Germany:
>>> Valentinskamp 70, Emporio, 19th Floor
>>> 20355 Hamburg
>>> M 0049(176)22943076
>>>
>>> The information contained in this communication may be CONFIDENTIAL and
>>> is intended only for the use of the recipient(s) named above. If you are
>>> not the intended recipient, you are hereby notified that any dissemination,
>>> distribution, or copying of this communication, or any of its contents, is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender and delete/destroy the original message and any
>>> copy of it from your computer or paper files.
>>>
>>
>>
>> --
>> Xiangyu Su
>> Java Developer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>> Valentinskamp 70, Emporio, 19th Floor
>> 20355 Hamburg
>> M 0049(176)22943076
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in 

Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-23 Thread Zili Chen
Hi Xiangyu,

Could you share the corresponding JIRA that fixed this issue?

Best,
tison.


Xiangyu Su  于2019年7月19日周五 下午8:47写道:

> btw. it seems like this issue has been fixed in 1.8.1
>
> On Fri, 19 Jul 2019 at 12:21, Xiangyu Su  wrote:
>
>> Ok, thanks.
>>
>> and this time-consuming until now always happens after 3rd checkpointing,
>> and this unexpected  time-consuming was always consistent (~ 4 min by under
>> 4G/min incoming traffic).
>>
>> On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:
>>
>>> Hi Xiangyu,
>>>
>>> Just took a glance at the relevant codes. There is a gap between
>>> calculating the duration and logging it out. I guess the checkpoint 4 is
>>> finished in 1 minute, but there is an unexpected time-consuming operation
>>> during that time. But I can't tell which part it is.
>>>
>>>
>>> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>>>
 Dear flink community,

 We are POC flink(1.8) to process data in real time, and using global
 checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
 application is consuming data from Kinesis.

 For my test e.g I am using checkpointing interval 5min. and minimum
 pause 2min.

 The issue what we saw is: It seems like flink checkpointing process
 would be idle for 3-4 min, before job manager get complete notification.

 here is some logging from job manager:

 2019-07-10 11:59:03,893 INFO  
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
 checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
 2019-07-10 12:05:01,836 INFO  
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
 checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes 
 in 58645 ms).

 As my understanding the logging above, the 
 completedCheckpoint(CheckpointCoordinator)
 object has been completed in 58645 ms, but the whole checkpointing process
 took ~ 6min.

 This logging is for 4th checkpointing, But the first 3 checkpointing
 were finished on time.
 Could you please tell me, why flink checkpointing in my test was
 starting "idle" for few minutes after 3 checkpointing?

 Best Regards
 --
 Xiangyu Su
 Java Developer
 xian...@smaato.com

 Smaato Inc.
 San Francisco - New York - Hamburg - Singapore
 www.smaato.com

 Germany:
 Valentinskamp 70, Emporio, 19th Floor
 20355 Hamburg
 M 0049(176)22943076

 The information contained in this communication may be CONFIDENTIAL and
 is intended only for the use of the recipient(s) named above. If you are
 not the intended recipient, you are hereby notified that any dissemination,
 distribution, or copying of this communication, or any of its contents, is
 strictly prohibited. If you have received this communication in error,
 please notify the sender and delete/destroy the original message and any
 copy of it from your computer or paper files.

>>>
>>
>> --
>> Xiangyu Su
>> Java Developer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>> Valentinskamp 70, Emporio, 19th Floor
>> 20355 Hamburg
>> M 0049(176)22943076
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>
>
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Valentinskamp 70, Emporio, 19th Floor
> 20355 Hamburg
> M 0049(176)22943076
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>


Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
btw. it seems like this issue has been fixed in 1.8.1

On Fri, 19 Jul 2019 at 12:21, Xiangyu Su  wrote:

> Ok, thanks.
>
> and this time-consuming until now always happens after 3rd checkpointing,
> and this unexpected  time-consuming was always consistent (~ 4 min by under
> 4G/min incoming traffic).
>
> On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:
>
>> Hi Xiangyu,
>>
>> Just took a glance at the relevant codes. There is a gap between
>> calculating the duration and logging it out. I guess the checkpoint 4 is
>> finished in 1 minute, but there is an unexpected time-consuming operation
>> during that time. But I can't tell which part it is.
>>
>>
>> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>>
>>> Dear flink community,
>>>
>>> We are POC flink(1.8) to process data in real time, and using global
>>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
>>> application is consuming data from Kinesis.
>>>
>>> For my test e.g I am using checkpointing interval 5min. and minimum
>>> pause 2min.
>>>
>>> The issue what we saw is: It seems like flink checkpointing process
>>> would be idle for 3-4 min, before job manager get complete notification.
>>>
>>> here is some logging from job manager:
>>>
>>> 2019-07-10 11:59:03,893 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>>> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
>>> 2019-07-10 12:05:01,836 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 
>>> 58645 ms).
>>>
>>> As my understanding the logging above, the 
>>> completedCheckpoint(CheckpointCoordinator)
>>> object has been completed in 58645 ms, but the whole checkpointing process
>>> took ~ 6min.
>>>
>>> This logging is for 4th checkpointing, But the first 3 checkpointing
>>> were finished on time.
>>> Could you please tell me, why flink checkpointing in my test was
>>> starting "idle" for few minutes after 3 checkpointing?
>>>
>>> Best Regards
>>> --
>>> Xiangyu Su
>>> Java Developer
>>> xian...@smaato.com
>>>
>>> Smaato Inc.
>>> San Francisco - New York - Hamburg - Singapore
>>> www.smaato.com
>>>
>>> Germany:
>>> Valentinskamp 70, Emporio, 19th Floor
>>> 20355 Hamburg
>>> M 0049(176)22943076
>>>
>>> The information contained in this communication may be CONFIDENTIAL and
>>> is intended only for the use of the recipient(s) named above. If you are
>>> not the intended recipient, you are hereby notified that any dissemination,
>>> distribution, or copying of this communication, or any of its contents, is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender and delete/destroy the original message and any
>>> copy of it from your computer or paper files.
>>>
>>
>
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Valentinskamp 70, Emporio, 19th Floor
> 20355 Hamburg
> M 0049(176)22943076
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>


-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
Ok, thanks.

and this time-consuming until now always happens after 3rd checkpointing,
and this unexpected  time-consuming was always consistent (~ 4 min by under
4G/min incoming traffic).

On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:

> Hi Xiangyu,
>
> Just took a glance at the relevant codes. There is a gap between
> calculating the duration and logging it out. I guess the checkpoint 4 is
> finished in 1 minute, but there is an unexpected time-consuming operation
> during that time. But I can't tell which part it is.
>
>
> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>
>> Dear flink community,
>>
>> We are POC flink(1.8) to process data in real time, and using global
>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
>> application is consuming data from Kinesis.
>>
>> For my test e.g I am using checkpointing interval 5min. and minimum pause
>> 2min.
>>
>> The issue what we saw is: It seems like flink checkpointing process would
>> be idle for 3-4 min, before job manager get complete notification.
>>
>> here is some logging from job manager:
>>
>> 2019-07-10 11:59:03,893 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
>> 2019-07-10 12:05:01,836 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 
>> 58645 ms).
>>
>> As my understanding the logging above, the 
>> completedCheckpoint(CheckpointCoordinator)
>> object has been completed in 58645 ms, but the whole checkpointing process
>> took ~ 6min.
>>
>> This logging is for 4th checkpointing, But the first 3 checkpointing were
>> finished on time.
>> Could you please tell me, why flink checkpointing in my test was starting
>> "idle" for few minutes after 3 checkpointing?
>>
>> Best Regards
>> --
>> Xiangyu Su
>> Java Developer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>> Valentinskamp 70, Emporio, 19th Floor
>> 20355 Hamburg
>> M 0049(176)22943076
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Biao Liu
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between
calculating the duration and logging it out. I guess the checkpoint 4 is
finished in 1 minute, but there is an unexpected time-consuming operation
during that time. But I can't tell which part it is.


Xiangyu Su  于2019年7月19日周五 下午4:14写道:

> Dear flink community,
>
> We are POC flink(1.8) to process data in real time, and using global
> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
> application is consuming data from Kinesis.
>
> For my test e.g I am using checkpointing interval 5min. and minimum pause
> 2min.
>
> The issue what we saw is: It seems like flink checkpointing process would
> be idle for 3-4 min, before job manager get complete notification.
>
> here is some logging from job manager:
>
> 2019-07-10 11:59:03,893 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
> 2019-07-10 12:05:01,836 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 
> 58645 ms).
>
> As my understanding the logging above, the 
> completedCheckpoint(CheckpointCoordinator)
> object has been completed in 58645 ms, but the whole checkpointing process
> took ~ 6min.
>
> This logging is for 4th checkpointing, But the first 3 checkpointing were
> finished on time.
> Could you please tell me, why flink checkpointing in my test was starting
> "idle" for few minutes after 3 checkpointing?
>
> Best Regards
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Valentinskamp 70, Emporio, 19th Floor
> 20355 Hamburg
> M 0049(176)22943076
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>


apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global
checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause
2min.

The issue what we saw is: It seems like flink checkpointing process would
be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator -
Triggering checkpoint 4 @ 1562759941082 for job
e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator -
Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d
(22387207650 bytes in 58645 ms).

As my understanding the logging above, the
completedCheckpoint(CheckpointCoordinator)
object has been completed in 58645 ms, but the whole checkpointing process
took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were
finished on time.
Could you please tell me, why flink checkpointing in my test was starting
"idle" for few minutes after 3 checkpointing?

Best Regards
-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.