Re: Trying to figure out why a slot takes a long time to checkpoint

Stefan Richter Tue, 18 Sep 2018 02:59:16 -0700

Hi,

from your screenshot, it looks like everything is running fine as soon as the 
snapshots are actually running, sync and async part times are normal. So I 
think the explanation is the time that the checkpoint barrier needs to reach 
this particular operator. It seems like there is a large queue of events in the 
buffers of that operator, in front of the barrier, and/or the operator is very 
slow at processing events. Given the beakdown at hand the time must be spend 
between the triggering of the checkpoint and the point where it reaches the 
operator that lacks behind.


Best,
Stefan

> Am 18.09.2018 um 09:39 schrieb Renjie Liu <liurenjie2...@gmail.com>:
> 
> I think what's weird is that  non of the three stages: alignment, sync cp, 
> async cp takes much time.
> 
> On Tue, Sep 18, 2018 at 3:20 PM Till Rohrmann <trohrm...@apache.org 
> <mailto:trohrm...@apache.org>> wrote:
> This behavior seems very odd Julio. Could you indeed share the debug logs of 
> all Flink processes in order to see why things are taking so long?
> 
> The checkpoint size of task #8 is twice as big as the second biggest 
> checkpoint. But this should not cause an increase in checkpoint time of a 
> factor of 8.
> 
> Cheers,
> Till
> 
> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com 
> <mailto:liurenjie2...@gmail.com>> wrote:
> Hi, Julio:
> This happens frequently? What state backend do you use? The async checkpoint 
> duration and sync checkpoint duration seems normal compared to others, it 
> seems that most of the time are spent acking the checkpoint.
> 
> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com 
> <mailto:yanghua1...@gmail.com>> wrote:
> Hi Julio,
> 
> Yes, it seems that fifty-five minutes is really long. 
> However, it is linear with the time and size of the previous task adjacent to 
> it in the diagram. 
> I think your real application is concerned about why Flink accesses HDFS so 
> slowly. 
> You can call the DEBUG log to see if you can find any clues, or post the log 
> to the mailing list to help others analyze the problem for you.
> 
> Thanks, vino.
> 
> Julio Biason <julio.bia...@azion.com <mailto:julio.bia...@azion.com>> 
> 于2018年9月15日周六 上午7:03写道：
> (Just an addendum: Although it's not a huge problem -- we can always increase 
> the checkpoint timeout time -- this anomalous situation makes me think there 
> is something wrong in our pipeline or in our cluster, and that is what is 
> making the checkpoint creation go crazy.)
> 
> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com 
> <mailto:julio.bia...@azion.com>> wrote:
> Hey guys,
> 
> On our pipeline, we have a single slot that it's taking longer to create the 
> checkpoint compared to other slots and we are wondering what could be causing 
> it.
> 
> The operator in question is the window metric -- the only element in the 
> pipeline that actually uses the state. While the other slots take 7 mins to 
> create the checkpoint, this one -- and only this one -- takes 55mins.
> 
> Is there something I should look at to understand what's going on?
> 
> (We are storing all checkpoints in HDFS, in case that helps.)
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554
> 
> 
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554
> -- 
> Liu, Renjie
> Software Engineer, MVAD
> -- 
> Liu, Renjie
> Software Engineer, MVAD

Re: Trying to figure out why a slot takes a long time to checkpoint

Reply via email to