How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Hongze Qi Sat, 01 Nov 2025 01:43:35 -0700
# Subject: Seeking Guidance: Implementing Stream-Batch Integration with 
Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message finds 
you well. I’m reaching out to seek insights and practical guidance on 
leveraging Paimon’s `partition-mark-done` feature to build a stream-batch 
integrated pipeline. Below is a detailed overview of our current setup, 
configuration plans, and key questions that require clarification. 1. Paimon 
Source Table Configuration (Partition-Mark-Done) Our Paimon table is 
partitioned by `dt` (date, format: `yyyyMMdd`) and `hour` (format: `HH`). We 
plan to enable partition done marking with the following core configurations: - 
`partition.mark-done-action`: `success-file,done-partition` (generate 
`_success` file in the partition directory and record done status in metastore) 
- `partition.idle-time-to-done`: `1h` (trigger done mark when a partition has 
no new data for 1 hour, adjustable based on business latency) - 
`partition.mark-done-action.mode`: `watermark` (to ensure accurate done timing 
for delayed data) 2. Stream-Batch Integration Architecture (Two Flink Jobs) I 
want to design two Flink jobs to fulfill both real-time and offline analytics 
requirements: - Real-Time Reporting Job: A streaming Flink job that reads the 
Paimon table with the default `scan.mode = 'latest'`. It consumes incremental 
data continuously and outputs to real-time dashboards, unaffected by the `done` 
mark. - Offline Batch Reporting Job: A batch Flink job intended to process only 
partitions marked as `done` (to ensure data completeness) and write results to 
offline data warehouses (e.g., Hive). However, I’m unsure about the specific 
operational steps for the batch job—particularly how to accurately target, 
read, and process only the partitions that have been marked as done.   3. Key 
Challenges & Requests for Guidance/Examples While we understand the basic 
functionality of `partition-mark-done` (generating `_success` files or 
metastore marks), we’re stuck on the practical implementation of the offline 
batch job. Specifically: 1. How can a Flink SQL batch job **accurately identify 
and consume the latest `done` partitions** (filtered by `dt` and `hour`)? Are 
there built-in functions or configuration parameters to directly filter `done` 
partitions? 2. What are the best practices to **avoid duplicate processing** of 
already processed `done` partitions in the batch job? 3. Are there reference 
cases, sample Flink SQL scripts, or integration examples with scheduling 
systems (e.g., Airflow, DolphinScheduler) to trigger batch jobs based on the 
`done` status of Paimon partitions? We’ve searched the official documentation 
and existing community discussions but haven’t found detailed implementation 
guidance for this specific stream-batch integration scenario. Any practical 
examples, code snippets, or architectural suggestions would be greatly 
appreciated. Additional context: We’re using Flink 1.20 and Paimon 1.2.0. Thank 
you for your time and support! Looking forward to your valuable insights. Best 
regards, Q7
How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Reply via email to