Re: checkpoint failure in forever loop suddenly even state size less than 1 mb

Fabian Hueske Mon, 02 Sep 2019 06:35:49 -0700

Hi Sushant,

It's hard to tell what's going on.
Maybe the thread pool of the async io operator is too small for the
ingested data rate?
This could cause the backpressure on the source and eventually also the
failing checkpoints.


Which Flink version are you using?

Best, Fabian


Am Do., 29. Aug. 2019 um 12:07 Uhr schrieb Sushant Sawant <
sushantsawant7...@gmail.com>:

> Hi Fabian,
> Sorry for one to one mail.
> Could you help me out with this m stuck with this issue over a week now.
>
> Thanks & Regards,
> Sushant Sawant
>
>
>
> On Tue, 27 Aug 2019, 15:23 Sushant Sawant, <sushantsawant7...@gmail.com>
> wrote:
>
>> Hi, firstly thanks for replying.
>>
>> Here it is.. configuration related to checkpoint.
>>
>> CheckpointingMode checkpointMode =
>> CheckpointingMode.valueOf(‘AT_LEAST_ONCE’);
>>
>> Long checkpointInterval =
>> Long.valueOf(parameterMap.get(Checkpoint.CHECKPOINT_INTERVAL.getKey()));
>>
>> StateBackend sb=new FsStateBackend(file:////);
>>
>> env.setStateBackend(sb);
>>
>> env.enableCheckpointing(300000, checkpointMode);
>>
>> env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
>>
>> env.getCheckpointConfig().setCheckpointTimeout(180000);
>>
>> env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
>> Thanks & Regards,
>> Sushant Sawant
>>
>> On Tue, 27 Aug 2019, 14:09 pengcheng...@bonc.com.cn, <
>> pengcheng...@bonc.com.cn> wrote:
>>
>>> Hi,What's your checkpoint config?
>>>
>>> ------------------------------
>>> pengcheng...@bonc.com.cn
>>>
>>>
>>> *From:* Sushant Sawant <sushantsawant7...@gmail.com>
>>> *Date:* 2019-08-27 15:31
>>> *To:* user <user@flink.apache.org>
>>> *Subject:* Re: checkpoint failure suddenly even state size less than 1
>>> mb
>>> Hi team,
>>> Anyone for help/suggestion, now we have stopped all input in kafka,
>>> there is no processing, no sink but checkpointing is failing.
>>> Is it like once checkpoint fails it keeps failing forever until job
>>> restart.
>>>
>>> Help appreciated.
>>>
>>> Thanks & Regards,
>>> Sushant Sawant
>>>
>>> On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <sushantsawant7...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>> m facing two issues which I believe are co-related though.
>>> 1. Kafka source shows high back pressure.
>>> 2. Sudden checkpoint failure for entire day until restart.
>>>
>>> My job does following thing,
>>> a. Read from Kafka
>>> b. Asyncio to external system
>>> c. Dumping in Cassandra, Elasticsearch
>>>
>>> Checkpointing is using file system.
>>> This flink job is proven under high load,
>>> around 5000/sec throughput.
>>> But recently we scaled down parallelism since, there wasn't any load in
>>> production and these issues started.
>>>
>>> Please find the status shown by flink dashboard.
>>> The github folder contains image where there was high back pressure and
>>> checkpoint failure
>>>
>>> https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
>>> and  after restart, "everything is fine" images in this folder,
>>>
>>> https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing
>>>
>>> --
>>> Could anyone point me towards direction what would have went wrong/
>>> trouble shooting??
>>>
>>>
>>> Thanks & Regards,
>>> Sushant Sawant
>>>
>>>
>>>
>>>

Re: checkpoint failure in forever loop suddenly even state size less than 1 mb

Reply via email to