Re: instable checkpointing after migration to flink 1.8

2019-09-11 Thread Yun Tang
: Re: instable checkpointing after migration to flink 1.8 Hi Yun, first of all, this reported problem looks like resolved for 2 days already, right after we changed the type and number of our nodes to give more heap to task managers and have more task-managers as well. Previously, our job

Re: instable checkpointing after migration to flink 1.8

2019-09-05 Thread Yun Tang
napshotStrategy.java#L249 Best Yun Tang From: Congxian Qiu Sent: Thursday, September 5, 2019 10:38 To: Bekir Oguz Cc: Stephan Ewen ; dev ; Niels Alebregtse ; Vladislav Bakayev Subject: Re: instable checkpointing after migration to flink 1.8 Another information from our

Re: instable checkpointing after migration to flink 1.8

2019-09-04 Thread Congxian Qiu
Another information from our private emails there ALWAYS have Kafka AbstractCoordinator logs about lost connection to Kafka at the same time we have the checkpoints confirmed. Bekir checked the Kafka broker log, but did not find any interesting things there. Best, Congxian Congxian Qiu

Re: instable checkpointing after migration to flink 1.8

2019-09-04 Thread Congxian Qiu
Hi Bekir, If it is the storage place for timers, for RocksDBStateBackend, timers can be stored in Heap or RocksDB[1][2] [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#tuning-rocksdb [2]

Re: instable checkpointing after migration to flink 1.8

2019-08-30 Thread Stephan Ewen
Hi all! A thought would be that this has something to do with timers. Does the task with that behavior use timers (windows, or process function)? If that is the case, some theories to check: - Could it be a timer firing storm coinciding with a checkpoint? Currently, that storm synchronously

Re: instable checkpointing after migration to flink 1.8

2019-08-30 Thread Congxian Qiu
CC flink dev mail list Update for those who may be interested in this issue, we'are still diagnosing this problem currently. Best, Congxian Congxian Qiu 于2019年8月29日周四 下午8:58写道: > Hi Bekir > > Currently, from what we have diagnosed, there is some task complete its > checkpoint too late (maybe

Re: instable checkpointing after migration to flink 1.8

2019-08-02 Thread Congxian Qiu
Hi Bekir Cloud you please also share the below information: - jobmanager.log - taskmanager.log(with debug info enabled) for the problematic subtask. - the DAG of your program (if can provide the skeleton program is better -- can send to me privately) For the subIndex, maybe we can use the deploy

Re: instable checkpointing after migration to flink 1.8

2019-08-02 Thread Bekir Oguz
Forgot to add the checkpoint details after it was complete. This is for that long running checkpoint with id 95632. > Op 2 aug. 2019, om 11:18 heeft Bekir Oguz het > volgende geschreven: > > Hi Congxian, > I was able to fetch the logs of the task manager (attached) and the > screenshots of

Re: instable checkpointing after migration to flink 1.8

2019-08-01 Thread Congxian Qiu
cc Bekir Best, Congxian Congxian Qiu 于2019年8月2日周五 下午12:23写道: > Hi Bekir > I’ll first summary the problem here(please correct me if I’m wrong) > 1. The same program runs on 1.6 never encounter such problems > 2. Some checkpoints completed too long (15+ min), but other normal > checkpoints

Re: instable checkpointing after migration to flink 1.8

2019-08-01 Thread Congxian Qiu
Hi Bekir I’ll first summary the problem here(please correct me if I’m wrong) 1. The same program runs on 1.6 never encounter such problems 2. Some checkpoints completed too long (15+ min), but other normal checkpoints complete less than 1 min 3. Some bad checkpoint will have a large sync time,

Re: instable checkpointing after migration to flink 1.8

2019-08-01 Thread Congxian Qiu
Hi Bekir I'll first comb through all the information here, and try to find out the reason with you, maybe need you to share some more information :) Best, Congxian Bekir Oguz 于2019年8月1日周四 下午5:00写道: > Hi Fabian, > Thanks for sharing this with us, but we’re already on version 1.8.1. > > What I

Re: instable checkpointing after migration to flink 1.8

2019-07-23 Thread Fabian Hueske
Hi Bekir, Another user reported checkpointing issues with Flink 1.8.0 [1]. These seem to be resolved with Flink 1.8.1. Hope this helps, Fabian [1] https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E Am Mi., 17. Juli 2019 um

Re: instable checkpointing after migration to flink 1.8 (production issue)

2019-07-18 Thread Congxian Qiu
Hi, Bekir First, The e2e time for a sub task is the $ack_time_received_in_JM - $trigger_time_in_JM. And checkpoint includes some steps on task side such as 1) receive first barrier; 2) barrier align(for exactly once); 3) operator snapshot sync part; 4) operator snapshot async part, the images you

Re: instable checkpointing after migration to flink 1.8 (production issue)

2019-07-18 Thread Bekir Oguz
Hi Congxian, Starting from this morning we have more issues with checkpointing in production. What we see is sync and async duration for some subtasks are very long but what strange is the total of sync and async durations are much less than the total end to end duration. Please check the

Re: instable checkpointing after migration to flink 1.8

2019-07-17 Thread Bekir Oguz
Hi Congxian,Yes we have incremental checkpointing enabled on RocksDBBackend.For further investigation, I have logged into one task manager node which had 15 min long snapshotting and found the logs under some /tmp directory.Attaching 2 logs files, one for a long/problematic snapshotting and one

Re: instable checkpointing after migration to flink 1.8

2019-07-17 Thread Congxian Qiu
ing from 1.6 to 1.8? Best, Congxian Bekir Oguz 于2019年7月17日周三 下午5:15写道: > Sending again with reduced image sizes due to Apache mail server error. > > Begin forwarded message: > > *From: *Bekir Oguz > *Subject: **Re: instable checkpointing after migration to flink 1.8* > *Date

Re: instable checkpointing after migration to flink 1.8

2019-07-17 Thread Congxian Qiu
Hi Bekir First of all, I think there is something wrong. the state size is almost the same, but the duration is different so much. The checkpoint for RocksDBStatebackend is dump sst files, then copy the needed sst files(if you enable incremental checkpoint, the sst files already on remote will