Hi Q7, For me, batch processing is generally a rigorous process, and executing a partition with certainty is a better practice.
To my knowledge, there is no similar way to "automatically identify and read Paimon partitions marked as done". Best, Jingsong On Thu, Nov 6, 2025 at 6:34 PM Hongze Qi <[email protected]> wrote: > > Hi Jingsong, > Thank you so much for your prompt and helpful reply—it’s really valuable for > my implementation! > I’m aligning with your suggestion to leverage script and scheduling system > capabilities, and here’s a quick recap of my current plan for the > stream-batch integration with Paimon: > > For real-time reporting: I’ll use a Flink SQL streaming job with scan.mode = > 'latest' to consume incremental data continuously, which works independently > of the done mark as expected. > For offline batch processing: I plan to use Flink SQL in batch mode to > process hourly partitions that have just been marked as done (ensuring data > completeness). > > I still have two key questions to clarify further, as I’m focusing on using > Flink SQL for both streams and batches: > > Can Flink SQL batch jobs automatically identify and read Paimon partitions > marked as done? Are there built-in syntax, functions, or configurations to > filter these done partitions directly (without extra scripts)? > If Flink batch can’t natively recognize the done status, do I still need to > rely on a scheduling system like Azkaban to trigger the batch job and pass > the done partition list (e.g., dt=20251105 + hour=10) to the Flink SQL job? > > We’re using Flink 1.20 and Paimon 1.2.0, and I’m trying to keep the offline > pipeline as simple as possible with Flink SQL. Any additional insights or > hints on these two questions would be greatly appreciated! > Thanks again for your support. Looking forward to your further guidance. > Best regards,Q7 > > > > > > > At 2025-11-05 17:58:05, "Jingsong Li" <[email protected]> wrote: > >Hi Q7, > > > >At present, the handling of done mark is only integrated within our > >internal system, and open-source systems require writing scripts to > >complete it. > > > >I think you can take a look at the scheduling system's related > >capabilities, which allow you to write scripts to check the done mark. > >I am not familiar with open-source scheduling systems. > > > >Best, > >Jingsong > > > >On Sat, Nov 1, 2025 at 4:43 PM Hongze Qi <[email protected]> wrote: > >> > >> # Subject: Seeking Guidance: Implementing Stream-Batch Integration with > >> Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message > >> finds you well. I’m reaching out to seek insights and practical guidance > >> on leveraging Paimon’s `partition-mark-done` feature to build a > >> stream-batch integrated pipeline. Below is a detailed overview of our > >> current setup, configuration plans, and key questions that require > >> clarification. 1. Paimon Source Table Configuration (Partition-Mark-Done) > >> Our Paimon table is partitioned by `dt` (date, format: `yyyyMMdd`) and > >> `hour` (format: `HH`). We plan to enable partition done marking with the > >> following core configurations: - `partition.mark-done-action`: > >> `success-file,done-partition` (generate `_success` file in the partition > >> directory and record done status in metastore) - > >> `partition.idle-time-to-done`: `1h` (trigger done mark when a partition > >> has no new data for 1 hour, adjustable based on business latency) - > >> `partition.mark-done-action.mode`: `watermark` (to ensure accurate done > >> timing for delayed data) 2. Stream-Batch Integration Architecture (Two > >> Flink Jobs) I want to design two Flink jobs to fulfill both real-time and > >> offline analytics requirements: - Real-Time Reporting Job: A streaming > >> Flink job that reads the Paimon table with the default `scan.mode = > >> 'latest'`. It consumes incremental data continuously and outputs to > >> real-time dashboards, unaffected by the `done` mark. - Offline Batch > >> Reporting Job: A batch Flink job intended to process only partitions > >> marked as `done` (to ensure data completeness) and write results to > >> offline data warehouses (e.g., Hive). However, I’m unsure about the > >> specific operational steps for the batch job—particularly how to > >> accurately target, read, and process only the partitions that have been > >> marked as done. 3. Key Challenges & Requests for Guidance/Examples While > >> we understand the basic functionality of `partition-mark-done` (generating > >> `_success` files or metastore marks), we’re stuck on the practical > >> implementation of the offline batch job. Specifically: 1. How can a Flink > >> SQL batch job **accurately identify and consume the latest `done` > >> partitions** (filtered by `dt` and `hour`)? Are there built-in functions > >> or configuration parameters to directly filter `done` partitions? 2. What > >> are the best practices to **avoid duplicate processing** of already > >> processed `done` partitions in the batch job? 3. Are there reference > >> cases, sample Flink SQL scripts, or integration examples with scheduling > >> systems (e.g., Airflow, DolphinScheduler) to trigger batch jobs based on > >> the `done` status of Paimon partitions? We’ve searched the official > >> documentation and existing community discussions but haven’t found > >> detailed implementation guidance for this specific stream-batch > >> integration scenario. Any practical examples, code snippets, or > >> architectural suggestions would be greatly appreciated. Additional > >> context: We’re using Flink 1.20 and Paimon 1.2.0. Thank you for your time > >> and support! Looking forward to your valuable insights. Best regards, Q7
