Hi Q7, At present, the handling of done mark is only integrated within our internal system, and open-source systems require writing scripts to complete it.
I think you can take a look at the scheduling system's related capabilities, which allow you to write scripts to check the done mark. I am not familiar with open-source scheduling systems. Best, Jingsong On Sat, Nov 1, 2025 at 4:43 PM Hongze Qi <[email protected]> wrote: > > # Subject: Seeking Guidance: Implementing Stream-Batch Integration with > Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message finds > you well. I’m reaching out to seek insights and practical guidance on > leveraging Paimon’s `partition-mark-done` feature to build a stream-batch > integrated pipeline. Below is a detailed overview of our current setup, > configuration plans, and key questions that require clarification. 1. Paimon > Source Table Configuration (Partition-Mark-Done) Our Paimon table is > partitioned by `dt` (date, format: `yyyyMMdd`) and `hour` (format: `HH`). We > plan to enable partition done marking with the following core configurations: > - `partition.mark-done-action`: `success-file,done-partition` (generate > `_success` file in the partition directory and record done status in > metastore) - `partition.idle-time-to-done`: `1h` (trigger done mark when a > partition has no new data for 1 hour, adjustable based on business latency) - > `partition.mark-done-action.mode`: `watermark` (to ensure accurate done > timing for delayed data) 2. Stream-Batch Integration Architecture (Two Flink > Jobs) I want to design two Flink jobs to fulfill both real-time and offline > analytics requirements: - Real-Time Reporting Job: A streaming Flink job that > reads the Paimon table with the default `scan.mode = 'latest'`. It consumes > incremental data continuously and outputs to real-time dashboards, unaffected > by the `done` mark. - Offline Batch Reporting Job: A batch Flink job intended > to process only partitions marked as `done` (to ensure data completeness) and > write results to offline data warehouses (e.g., Hive). However, I’m unsure > about the specific operational steps for the batch job—particularly how to > accurately target, read, and process only the partitions that have been > marked as done. 3. Key Challenges & Requests for Guidance/Examples While we > understand the basic functionality of `partition-mark-done` (generating > `_success` files or metastore marks), we’re stuck on the practical > implementation of the offline batch job. Specifically: 1. How can a Flink SQL > batch job **accurately identify and consume the latest `done` partitions** > (filtered by `dt` and `hour`)? Are there built-in functions or configuration > parameters to directly filter `done` partitions? 2. What are the best > practices to **avoid duplicate processing** of already processed `done` > partitions in the batch job? 3. Are there reference cases, sample Flink SQL > scripts, or integration examples with scheduling systems (e.g., Airflow, > DolphinScheduler) to trigger batch jobs based on the `done` status of Paimon > partitions? We’ve searched the official documentation and existing community > discussions but haven’t found detailed implementation guidance for this > specific stream-batch integration scenario. Any practical examples, code > snippets, or architectural suggestions would be greatly appreciated. > Additional context: We’re using Flink 1.20 and Paimon 1.2.0. Thank you for > your time and support! Looking forward to your valuable insights. Best > regards, Q7
