大家好,

flink sql消费kafka join普通表是会性能爬坡吗?


背景是flink 1.12.0 使用flink sql在yarn per-job发布,消费kafka topic=trades,然后join 数据库里的维表 
shop_meta
现在发现每次重启flink sql job,或上游突然增加大量写入时,flink sql的消费速度总是慢慢增加上来,这样就会造成上游积压,等flink 
sql消费速度上来之后才能慢慢把积压消费完毕。


更多的信息:
trades是avro格式,大概有10个字段,但其中有一个字段full_info是一个大json,我这边写了处理json的UDF,就为每个字段都需要处理那个大json。最后生成将近25个字段写下游kafka
shop_meta是普通表,没有时间字段,总共有12个字段,30000行左右。整个表数据和索引加起来是16MB;更新频率非常低。现在读jdbc的配置为lookup.cache.max-rows
 = 20000;lookup.cache.ttl = 2h;scan.fetch-size = 1000
SQL示例如下
```
SELECT 
t.shop_id, s.shop_name, 
        ...
CAST(json_path_to_str(full_info, '$.response.trade.price', '0.0') AS DOUBLE) 
price, "
CAST(json_path_to_str(full_info, '$.response.trade.payment', '0.0') AS DOUBLE) 
payment, "
CAST(json_path_to_str(full_info, '$.response.trade.total_fee', '0.0') AS 
DOUBLE) total_fee, "
CAST(json_path_to_str(full_info, '$.response.trade.discount_fee', '0.0') AS 
DOUBLE) discount_fee, "
CAST(json_path_to_str(full_info, '$.response.trade.adjust_fee', '0.0') AS 
DOUBLE) adjust_fee, "
CAST(json_path_to_str(full_info, '$.response.trade.received_payment', '0.0') AS 
DOUBLE) received_payment, "
CAST(json_path_to_str(full_info, '$.response.trade.post_fee', '0.0') AS DOUBLE) 
post_fee, "
json_path_to_str(full_info, '$.response.trade.receiver_name', '') 
receiver_name, "
json_path_to_str(full_info, '$.response.trade.receiver_country', '') 
receiver_country, "
json_path_to_str(full_info, '$.response.trade.receiver_state', '') 
receiver_state, "
json_path_to_str(full_info, '$.response.trade.receiver_city', '') 
receiver_city, "
FROM trades t LEFT JOIN shop_meta FOR SYSTEM_TIME AS OF t.proc_time AS s 
ON t.shop_id=s.shop_id
```


考虑到整个job里只有简单的ETL,不涉及中间状态,flink对task_manager的配置为
taskmanager.memory.managed.fraction = 0.1
taskmanager.memory.network.fraction = 0.05
实际运行中,task_manager总内存为6G,6 slots,最大并行度为6,所以只有一个task manager。
在监控页面看到task heap=4.13 GB,实际使用heap_used指标比较稳定。
在监控页面中可以看到随着消费速度越来越快,task manager 
CPU利用率越来越高,KafkaConsumer_topic_partition_currentOffsets - 
KafkaConsumer_topic_partition_committedOffsets 也在随着消费速度上涨,新生代GC次数和时间也在上涨
当消费完积压后,前两个指标降低,新生代GC趋于平稳


请问有什么调查或解决的方向吗?
谢谢大家

回复