Re: How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Jingsong Li Wed, 05 Nov 2025 01:58:28 -0800

Hi Q7,

At present, the handling of done mark is only integrated within our
internal system, and open-source systems require writing scripts to
complete it.


I think you can take a look at the scheduling system's related
capabilities, which allow you to write scripts to check the done mark.
I am not familiar with open-source scheduling systems.

Best,
Jingsong

On Sat, Nov 1, 2025 at 4:43 PM Hongze Qi <[email protected]> wrote:
>
> # Subject: Seeking Guidance: Implementing Stream-Batch Integration with 
> Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message finds 
> you well. I’m reaching out to seek insights and practical guidance on 
> leveraging Paimon’s `partition-mark-done` feature to build a stream-batch 
> integrated pipeline. Below is a detailed overview of our current setup, 
> configuration plans, and key questions that require clarification. 1. Paimon 
> Source Table Configuration (Partition-Mark-Done) Our Paimon table is 
> partitioned by `dt` (date, format: `yyyyMMdd`) and `hour` (format: `HH`). We 
> plan to enable partition done marking with the following core configurations: 
> - `partition.mark-done-action`: `success-file,done-partition` (generate 
> `_success` file in the partition directory and record done status in 
> metastore) - `partition.idle-time-to-done`: `1h` (trigger done mark when a 
> partition has no new data for 1 hour, adjustable based on business latency) - 
> `partition.mark-done-action.mode`: `watermark` (to ensure accurate done 
> timing for delayed data) 2. Stream-Batch Integration Architecture (Two Flink 
> Jobs) I want to design two Flink jobs to fulfill both real-time and offline 
> analytics requirements: - Real-Time Reporting Job: A streaming Flink job that 
> reads the Paimon table with the default `scan.mode = 'latest'`. It consumes 
> incremental data continuously and outputs to real-time dashboards, unaffected 
> by the `done` mark. - Offline Batch Reporting Job: A batch Flink job intended 
> to process only partitions marked as `done` (to ensure data completeness) and 
> write results to offline data warehouses (e.g., Hive). However, I’m unsure 
> about the specific operational steps for the batch job—particularly how to 
> accurately target, read, and process only the partitions that have been 
> marked as done.   3. Key Challenges & Requests for Guidance/Examples While we 
> understand the basic functionality of `partition-mark-done` (generating 
> `_success` files or metastore marks), we’re stuck on the practical 
> implementation of the offline batch job. Specifically: 1. How can a Flink SQL 
> batch job **accurately identify and consume the latest `done` partitions** 
> (filtered by `dt` and `hour`)? Are there built-in functions or configuration 
> parameters to directly filter `done` partitions? 2. What are the best 
> practices to **avoid duplicate processing** of already processed `done` 
> partitions in the batch job? 3. Are there reference cases, sample Flink SQL 
> scripts, or integration examples with scheduling systems (e.g., Airflow, 
> DolphinScheduler) to trigger batch jobs based on the `done` status of Paimon 
> partitions? We’ve searched the official documentation and existing community 
> discussions but haven’t found detailed implementation guidance for this 
> specific stream-batch integration scenario. Any practical examples, code 
> snippets, or architectural suggestions would be greatly appreciated. 
> Additional context: We’re using Flink 1.20 and Paimon 1.2.0. Thank you for 
> your time and support! Looking forward to your valuable insights. Best 
> regards, Q7

Re: How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Reply via email to