subject:"Flink 从checkpoint恢复时，部分状态没有正确恢复"

Re: Flink 从checkpoint恢复时，部分状态没有正确恢复

2021-08-30 文章 Benchao Li

这个问题已经在1.12中修复了，参考:
https://issues.apache.org/jira/browse/FLINK-18688

Benchao Li  于2021年8月30日周一 下午7:49写道：

> Hi xingxing，
>
> 看起来你可能也遇到了这个bug了。
> 我们遇到过一个bug是这样的，在group by的多个字段里面，如果有2个及以上变长字段的话，会导致底层的BinaryRow序列化
> 的结果不稳定，进而导致状态恢复会错误。
> 先解释下变长字段，这个指的是4byte存不下的数据类型，比如典型的varchar、list、map、row等等；
> 然后再解释下这个结果不稳定的原因，这个是因为在底层的代码生成里面有一个优化，会按照类型进行分组，然后进行优化，
> 但是这个分组的过程用的是一个HashMap[1]，会导致字段顺序不是确定性的，有时候是这个顺序，有时候又是另外一个顺序，
> 导致最终的BinaryRow的序列化结果是不稳定的，进而导致无法从checkpoint恢复。
>
> 然后典型的多种变长类型，其实是varchar nullable 和 varchar not null 以及
> char(n)，尤其是你这种用了很多常量字符串的场景，
> 容易产生后两种类型，在加上普通字段以及函数都会产生的第一种类型，就会触发这个bug了。
>
> [1]
> https://github.com/apache/flink/blob/release-1.9/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/codegen/ProjectionCodeGenerator.scala#L92
>
> dixingxing  于2021年8月25日周三 下午8:57写道：
>
>> Hi Flink 社区：
>> 我们的Flink版本是1.9.2，用的是blink planer，我们今天遇到一个问题，目前没有定位到原因，现象如下：
>>
>> 手动重启任务时，指定了从一个固定的checkpoint恢复时，有一定的概率，一部分状态数据无法正常恢复，启动后Flink任务本身可以正常运行，且日志中没有明显的报错信息。
>> 具体现象是：type=realshow的数据没有从状态恢复，也就是从0开始累加，而type=show和type=click的数据是正常从状态恢复的。
>>
>>
>> SQL大致如下：
>> createview view1 as
>> select event_id, act_time, device_id
>> from table1
>> where `getStringFromJson`(`act_argv`, 'ispin', '') <>'1'
>> and event_id in
>> ('article_newest_list_show','article_newest_list_sight_show',
>> 'article_list_item_click', 'article_auto_video_play_click');
>>
>>
>> --天的数据
>> insertinto table2
>> select platform, type, `time`, count(1) as pv, hll_uv(device_id) as uv
>> from
>> (select'03'as platform, trim(casewhen event_id
>> ='article_newest_list_show'then'show'
>> when event_id ='article_newest_list_sight_show'then'realshow'
>> when event_id ='article_list_item_click'then'click'else''end) astype,
>> `date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`,
>> device_id
>> from view1
>> where event_id in
>> ('article_newest_list_show','article_newest_list_sight_show',
>> 'article_list_item_click')
>> unionall
>> select'03'as platform, 'click_total'astype,
>> `date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`,
>> device_id
>> from view1
>> where event_id in ('article_list_item_click',
>> 'article_auto_video_play_click'))a
>> groupby platform, type, `time`;
>>
>>
>> 期待大家的帮助与回复，希望能给些问题排查的思路！
>>
>>
>>
>>
>
> --
>
> Best,
> Benchao Li
>


-- 

Best,
Benchao Li

Re: Flink 从checkpoint恢复时，部分状态没有正确恢复

2021-08-30 文章 Benchao Li

Hi xingxing，

看起来你可能也遇到了这个bug了。
我们遇到过一个bug是这样的，在group by的多个字段里面，如果有2个及以上变长字段的话，会导致底层的BinaryRow序列化
的结果不稳定，进而导致状态恢复会错误。
先解释下变长字段，这个指的是4byte存不下的数据类型，比如典型的varchar、list、map、row等等；
然后再解释下这个结果不稳定的原因，这个是因为在底层的代码生成里面有一个优化，会按照类型进行分组，然后进行优化，
但是这个分组的过程用的是一个HashMap[1]，会导致字段顺序不是确定性的，有时候是这个顺序，有时候又是另外一个顺序，
导致最终的BinaryRow的序列化结果是不稳定的，进而导致无法从checkpoint恢复。

然后典型的多种变长类型，其实是varchar nullable 和 varchar not null 以及
char(n)，尤其是你这种用了很多常量字符串的场景，
容易产生后两种类型，在加上普通字段以及函数都会产生的第一种类型，就会触发这个bug了。

[1]
https://github.com/apache/flink/blob/release-1.9/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/codegen/ProjectionCodeGenerator.scala#L92

dixingxing  于2021年8月25日周三 下午8:57写道：

> Hi Flink 社区：
> 我们的Flink版本是1.9.2，用的是blink planer，我们今天遇到一个问题，目前没有定位到原因，现象如下：
>
> 手动重启任务时，指定了从一个固定的checkpoint恢复时，有一定的概率，一部分状态数据无法正常恢复，启动后Flink任务本身可以正常运行，且日志中没有明显的报错信息。
> 具体现象是：type=realshow的数据没有从状态恢复，也就是从0开始累加，而type=show和type=click的数据是正常从状态恢复的。
>
>
> SQL大致如下：
> createview view1 as
> select event_id, act_time, device_id
> from table1
> where `getStringFromJson`(`act_argv`, 'ispin', '') <>'1'
> and event_id in
> ('article_newest_list_show','article_newest_list_sight_show',
> 'article_list_item_click', 'article_auto_video_play_click');
>
>
> --天的数据
> insertinto table2
> select platform, type, `time`, count(1) as pv, hll_uv(device_id) as uv
> from
> (select'03'as platform, trim(casewhen event_id
> ='article_newest_list_show'then'show'
> when event_id ='article_newest_list_sight_show'then'realshow'
> when event_id ='article_list_item_click'then'click'else''end) astype,
> `date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`,
> device_id
> from view1
> where event_id in
> ('article_newest_list_show','article_newest_list_sight_show',
> 'article_list_item_click')
> unionall
> select'03'as platform, 'click_total'astype,
> `date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`,
> device_id
> from view1
> where event_id in ('article_list_item_click',
> 'article_auto_video_play_click'))a
> groupby platform, type, `time`;
>
>
> 期待大家的帮助与回复，希望能给些问题排查的思路！
>
>
>
>

-- 

Best,
Benchao Li

Flink 从checkpoint恢复时，部分状态没有正确恢复

2021-08-25 文章 dixingxing

Hi Flink 社区：
我们的Flink版本是1.9.2，用的是blink planer，我们今天遇到一个问题，目前没有定位到原因，现象如下：
手动重启任务时，指定了从一个固定的checkpoint恢复时，有一定的概率，一部分状态数据无法正常恢复，启动后Flink任务本身可以正常运行，且日志中没有明显的报错信息。
具体现象是：type=realshow的数据没有从状态恢复，也就是从0开始累加，而type=show和type=click的数据是正常从状态恢复的。


SQL大致如下：
createview view1 as
select event_id, act_time, device_id
from table1
where `getStringFromJson`(`act_argv`, 'ispin', '') <>'1'
and event_id in ('article_newest_list_show','article_newest_list_sight_show', 
'article_list_item_click', 'article_auto_video_play_click');


--天的数据
insertinto table2
select platform, type, `time`, count(1) as pv, hll_uv(device_id) as uv
from
(select'03'as platform, trim(casewhen event_id 
='article_newest_list_show'then'show'
when event_id ='article_newest_list_sight_show'then'realshow'
when event_id ='article_list_item_click'then'click'else''end) astype,
`date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`, device_id
from view1
where event_id in ('article_newest_list_show','article_newest_list_sight_show', 
'article_list_item_click')
unionall
select'03'as platform, 'click_total'astype,
`date_parse`(`act_time`, '-MM-dd HH:mm:ss', 'MMdd') as `time`, device_id
from view1
where event_id in ('article_list_item_click', 'article_auto_video_play_click'))a
groupby platform, type, `time`;


期待大家的帮助与回复，希望能给些问题排查的思路！

Re: Flink 从checkpoint恢复时，部分状态没有正确恢复

Re: Flink 从checkpoint恢复时，部分状态没有正确恢复

Flink 从checkpoint恢复时，部分状态没有正确恢复

3 matches

Site Navigation

Mail list logo

Footer information