Hey guys, Stefan,
Yeah, sorry about the stacks. Completely forgot about them.
But I think we figured out why it's taking so long (and yeah, Stefan was
right from the start): This specific slot is receiving 5x more records than
any other slot (on a recent run, it had 10x more records than the seco
A debug log for state backend and checkpoint coordinator could also help.
> Am 20.09.2018 um 14:19 schrieb Stefan Richter :
>
> Hi,
>
> if some tasks take like 50 minutes, could you wait until such a checkpoint is
> in progress and (let’s say after 10 minutes) log into the node and create a
>
Hi,
if some tasks take like 50 minutes, could you wait until such a checkpoint is
in progress and (let’s say after 10 minutes) log into the node and create a (or
multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing
task. This could help to figure out where it is stuck
Hey guys,
So, switching to Ceph/S3 didn't shine any new lights on the issue. Although
the times are a bit higher, just a few slots are taking a magnitude longer
to save. So I changed the logs for DEBUG.
The problem is: I'm not seeing anything that seems relevant; only pings
from ZooKeeper, heartb
Ok, thanks for the update.
On Tue, Sep 18, 2018, 17:34 Julio Biason wrote:
> Hey TIll (and others),
>
> We don't have debug logs yet, but we decided to remove a related
> component: HDFS.
>
> We are moving the storage to our Ceph install (using S3), which is running
> for longer than our HDFS in
Hey TIll (and others),
We don't have debug logs yet, but we decided to remove a related component:
HDFS.
We are moving the storage to our Ceph install (using S3), which is running
for longer than our HDFS install and we know, for sure, it runs without any
problems (specially 'cause we have more p
Hi,
alignment time much lower than the whole checkpoint duration may due to the
downstream receive first barrier and start alignment at 5min and alignment
for 2min to complete the checkpoint. So the whole duration can be 7min with
the alignment with 2min average. I just statement a possibility
T
Adding to my previous email, I start to doubt a little bit about the
explanation because also alignment times are very low. Could it be possible
that it takes very long for the checkpoint operation (for whatever reason) to
get the checkpointing lock?
> Am 18.09.2018 um 11:58 schrieb Stefan Rich
Hi,
from your screenshot, it looks like everything is running fine as soon as the
snapshots are actually running, sync and async part times are normal. So I
think the explanation is the time that the checkpoint barrier needs to reach
this particular operator. It seems like there is a large queu
I think what's weird is that non of the three stages: alignment, sync cp,
async cp takes much time.
On Tue, Sep 18, 2018 at 3:20 PM Till Rohrmann wrote:
> This behavior seems very odd Julio. Could you indeed share the debug logs
> of all Flink processes in order to see why things are taking so
This behavior seems very odd Julio. Could you indeed share the debug logs
of all Flink processes in order to see why things are taking so long?
The checkpoint size of task #8 is twice as big as the second biggest
checkpoint. But this should not cause an increase in checkpoint time of a
factor of 8
Hi, Julio:
This happens frequently? What state backend do you use? The async
checkpoint duration and sync checkpoint duration seems normal compared to
others, it seems that most of the time are spent acking the checkpoint.
On Sun, Sep 16, 2018 at 9:24 AM vino yang wrote:
> Hi Julio,
>
> Yes, it
Hi Julio,
Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent
to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so
slowly.
You can call the DEBUG log to see if you can find any
(Just an addendum: Although it's not a huge problem -- we can always
increase the checkpoint timeout time -- this anomalous situation makes me
think there is something wrong in our pipeline or in our cluster, and that
is what is making the checkpoint creation go crazy.)
On Fri, Sep 14, 2018 at 8:0
14 matches
Mail list logo