Re: Re: How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Jingsong Li Mon, 10 Nov 2025 21:21:26 -0800

Hi Q7,

For me, batch processing is generally a rigorous process, and
executing a partition with certainty is a better practice.


To my knowledge, there is no similar way to "automatically identify
and read Paimon partitions marked as done".

Best,
Jingsong

On Thu, Nov 6, 2025 at 6:34 PM Hongze Qi <[email protected]> wrote:
>
> Hi Jingsong,
> Thank you so much for your prompt and helpful reply—it’s really valuable for 
> my implementation!
> I’m aligning with your suggestion to leverage script and scheduling system 
> capabilities, and here’s a quick recap of my current plan for the 
> stream-batch integration with Paimon:
>
> For real-time reporting: I’ll use a Flink SQL streaming job with scan.mode = 
> 'latest' to consume incremental data continuously, which works independently 
> of the done mark as expected.
> For offline batch processing: I plan to use Flink SQL in batch mode to 
> process hourly partitions that have just been marked as done (ensuring data 
> completeness).
>
> I still have two key questions to clarify further, as I’m focusing on using 
> Flink SQL for both streams and batches:
>
> Can Flink SQL batch jobs automatically identify and read Paimon partitions 
> marked as done? Are there built-in syntax, functions, or configurations to 
> filter these done partitions directly (without extra scripts)?
> If Flink batch can’t natively recognize the done status, do I still need to 
> rely on a scheduling system like Azkaban to trigger the batch job and pass 
> the done partition list (e.g., dt=20251105 + hour=10) to the Flink SQL job?
>
> We’re using Flink 1.20 and Paimon 1.2.0, and I’m trying to keep the offline 
> pipeline as simple as possible with Flink SQL. Any additional insights or 
> hints on these two questions would be greatly appreciated!
> Thanks again for your support. Looking forward to your further guidance.
> Best regards,Q7
>
>
>
>
>
>
> At 2025-11-05 17:58:05, "Jingsong Li" <[email protected]> wrote:
> >Hi Q7,
> >
> >At present, the handling of done mark is only integrated within our
> >internal system, and open-source systems require writing scripts to
> >complete it.
> >
> >I think you can take a look at the scheduling system's related
> >capabilities, which allow you to write scripts to check the done mark.
> >I am not familiar with open-source scheduling systems.
> >
> >Best,
> >Jingsong
> >
> >On Sat, Nov 1, 2025 at 4:43 PM Hongze Qi <[email protected]> wrote:
> >>
> >> # Subject: Seeking Guidance: Implementing Stream-Batch Integration with 
> >> Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message 
> >> finds you well. I’m reaching out to seek insights and practical guidance 
> >> on leveraging Paimon’s `partition-mark-done` feature to build a 
> >> stream-batch integrated pipeline. Below is a detailed overview of our 
> >> current setup, configuration plans, and key questions that require 
> >> clarification. 1. Paimon Source Table Configuration (Partition-Mark-Done) 
> >> Our Paimon table is partitioned by `dt` (date, format: `yyyyMMdd`) and 
> >> `hour` (format: `HH`). We plan to enable partition done marking with the 
> >> following core configurations: - `partition.mark-done-action`: 
> >> `success-file,done-partition` (generate `_success` file in the partition 
> >> directory and record done status in metastore) - 
> >> `partition.idle-time-to-done`: `1h` (trigger done mark when a partition 
> >> has no new data for 1 hour, adjustable based on business latency) - 
> >> `partition.mark-done-action.mode`: `watermark` (to ensure accurate done 
> >> timing for delayed data) 2. Stream-Batch Integration Architecture (Two 
> >> Flink Jobs) I want to design two Flink jobs to fulfill both real-time and 
> >> offline analytics requirements: - Real-Time Reporting Job: A streaming 
> >> Flink job that reads the Paimon table with the default `scan.mode = 
> >> 'latest'`. It consumes incremental data continuously and outputs to 
> >> real-time dashboards, unaffected by the `done` mark. - Offline Batch 
> >> Reporting Job: A batch Flink job intended to process only partitions 
> >> marked as `done` (to ensure data completeness) and write results to 
> >> offline data warehouses (e.g., Hive). However, I’m unsure about the 
> >> specific operational steps for the batch job—particularly how to 
> >> accurately target, read, and process only the partitions that have been 
> >> marked as done.   3. Key Challenges & Requests for Guidance/Examples While 
> >> we understand the basic functionality of `partition-mark-done` (generating 
> >> `_success` files or metastore marks), we’re stuck on the practical 
> >> implementation of the offline batch job. Specifically: 1. How can a Flink 
> >> SQL batch job **accurately identify and consume the latest `done` 
> >> partitions** (filtered by `dt` and `hour`)? Are there built-in functions 
> >> or configuration parameters to directly filter `done` partitions? 2. What 
> >> are the best practices to **avoid duplicate processing** of already 
> >> processed `done` partitions in the batch job? 3. Are there reference 
> >> cases, sample Flink SQL scripts, or integration examples with scheduling 
> >> systems (e.g., Airflow, DolphinScheduler) to trigger batch jobs based on 
> >> the `done` status of Paimon partitions? We’ve searched the official 
> >> documentation and existing community discussions but haven’t found 
> >> detailed implementation guidance for this specific stream-batch 
> >> integration scenario. Any practical examples, code snippets, or 
> >> architectural suggestions would be greatly appreciated. Additional 
> >> context: We’re using Flink 1.20 and Paimon 1.2.0. Thank you for your time 
> >> and support! Looking forward to your valuable insights. Best regards, Q7

Re: Re: How to Implement Stream-Batch Integration with Paimon's Partition-Mark-Done

Reply via email to