: Re: instable checkpointing after migration to flink 1.8
Hi Yun,
first of all, this reported problem looks like resolved for 2 days already,
right after we changed the type and number of our nodes to give more heap to
task managers and have more task-managers as well.
Previously, our job
napshotStrategy.java#L249
Best
Yun Tang
From: Congxian Qiu
Sent: Thursday, September 5, 2019 10:38
To: Bekir Oguz
Cc: Stephan Ewen ; dev ; Niels
Alebregtse ; Vladislav Bakayev
Subject: Re: instable checkpointing after migration to flink 1.8
Another information from our
Another information from our private emails
there ALWAYS have Kafka AbstractCoordinator logs about lost connection to
Kafka at the same time we have the checkpoints confirmed. Bekir checked the
Kafka broker log, but did not find any interesting things there.
Best,
Congxian
Congxian Qiu
Hi Bekir,
If it is the storage place for timers, for RocksDBStateBackend, timers can
be stored in Heap or RocksDB[1][2]
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#tuning-rocksdb
[2]
Hi all!
A thought would be that this has something to do with timers. Does the task
with that behavior use timers (windows, or process function)?
If that is the case, some theories to check:
- Could it be a timer firing storm coinciding with a checkpoint?
Currently, that storm synchronously
CC flink dev mail list
Update for those who may be interested in this issue, we'are still
diagnosing this problem currently.
Best,
Congxian
Congxian Qiu 于2019年8月29日周四 下午8:58写道:
> Hi Bekir
>
> Currently, from what we have diagnosed, there is some task complete its
> checkpoint too late (maybe
Hi Bekir
Cloud you please also share the below information:
- jobmanager.log
- taskmanager.log(with debug info enabled) for the problematic subtask.
- the DAG of your program (if can provide the skeleton program is better --
can send to me privately)
For the subIndex, maybe we can use the deploy
Forgot to add the checkpoint details after it was complete. This is for that
long running checkpoint with id 95632.
> Op 2 aug. 2019, om 11:18 heeft Bekir Oguz het
> volgende geschreven:
>
> Hi Congxian,
> I was able to fetch the logs of the task manager (attached) and the
> screenshots of
cc Bekir
Best,
Congxian
Congxian Qiu 于2019年8月2日周五 下午12:23写道:
> Hi Bekir
> I’ll first summary the problem here(please correct me if I’m wrong)
> 1. The same program runs on 1.6 never encounter such problems
> 2. Some checkpoints completed too long (15+ min), but other normal
> checkpoints
Hi Bekir
I’ll first summary the problem here(please correct me if I’m wrong)
1. The same program runs on 1.6 never encounter such problems
2. Some checkpoints completed too long (15+ min), but other normal
checkpoints complete less than 1 min
3. Some bad checkpoint will have a large sync time,
Hi Bekir
I'll first comb through all the information here, and try to find out the
reason with you, maybe need you to share some more information :)
Best,
Congxian
Bekir Oguz 于2019年8月1日周四 下午5:00写道:
> Hi Fabian,
> Thanks for sharing this with us, but we’re already on version 1.8.1.
>
> What I
Hi Bekir,
Another user reported checkpointing issues with Flink 1.8.0 [1].
These seem to be resolved with Flink 1.8.1.
Hope this helps,
Fabian
[1]
https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E
Am Mi., 17. Juli 2019 um
Hi, Bekir
First, The e2e time for a sub task is the $ack_time_received_in_JM -
$trigger_time_in_JM. And checkpoint includes some steps on task side such
as 1) receive first barrier; 2) barrier align(for exactly once); 3)
operator snapshot sync part; 4) operator snapshot async part, the images
you
Hi Congxian,
Starting from this morning we have more issues with checkpointing in
production. What we see is sync and async duration for some subtasks are very
long but what strange is the total of sync and async durations are much less
than the total end to end duration. Please check the
Hi Congxian,Yes we have incremental checkpointing enabled on RocksDBBackend.For further investigation, I have logged into one task manager node which had 15 min long snapshotting and found the logs under some /tmp directory.Attaching 2 logs files, one for a long/problematic snapshotting and one
ing from 1.6 to 1.8?
Best,
Congxian
Bekir Oguz 于2019年7月17日周三 下午5:15写道:
> Sending again with reduced image sizes due to Apache mail server error.
>
> Begin forwarded message:
>
> *From: *Bekir Oguz
> *Subject: **Re: instable checkpointing after migration to flink 1.8*
> *Date
Hi Bekir
First of all, I think there is something wrong. the state size is almost
the same, but the duration is different so much.
The checkpoint for RocksDBStatebackend is dump sst files, then copy the
needed sst files(if you enable incremental checkpoint, the sst files
already on remote will
17 matches
Mail list logo