Hi Jingsong, Thank you so much for your prompt and helpful reply—it’s really valuable for my implementation! I’m aligning with your suggestion to leverage script and scheduling system capabilities, and here’s a quick recap of my current plan for the stream-batch integration with Paimon: For real-time reporting: I’ll use a Flink SQL streaming job with scan.mode = 'latest' to consume incremental data continuously, which works independently of the done mark as expected. For offline batch processing: I plan to use Flink SQL in batch mode to process hourly partitions that have just been marked as done (ensuring data completeness). I still have two key questions to clarify further, as I’m focusing on using Flink SQL for both streams and batches: Can Flink SQL batch jobs automatically identify and read Paimon partitions marked as done? Are there built-in syntax, functions, or configurations to filter these done partitions directly (without extra scripts)? If Flink batch can’t natively recognize the done status, do I still need to rely on a scheduling system like Azkaban to trigger the batch job and pass the done partition list (e.g., dt=20251105 + hour=10) to the Flink SQL job? We’re using Flink 1.20 and Paimon 1.2.0, and I’m trying to keep the offline pipeline as simple as possible with Flink SQL. Any additional insights or hints on these two questions would be greatly appreciated! Thanks again for your support. Looking forward to your further guidance. Best regards,Q7
At 2025-11-05 17:58:05, "Jingsong Li" <[email protected]> wrote: >Hi Q7, > >At present, the handling of done mark is only integrated within our >internal system, and open-source systems require writing scripts to >complete it. > >I think you can take a look at the scheduling system's related >capabilities, which allow you to write scripts to check the done mark. >I am not familiar with open-source scheduling systems. > >Best, >Jingsong > >On Sat, Nov 1, 2025 at 4:43 PM Hongze Qi <[email protected]> wrote: >> >> # Subject: Seeking Guidance: Implementing Stream-Batch Integration with >> Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message >> finds you well. I’m reaching out to seek insights and practical guidance on >> leveraging Paimon’s `partition-mark-done` feature to build a stream-batch >> integrated pipeline. Below is a detailed overview of our current setup, >> configuration plans, and key questions that require clarification. 1. Paimon >> Source Table Configuration (Partition-Mark-Done) Our Paimon table is >> partitioned by `dt` (date, format: `yyyyMMdd`) and `hour` (format: `HH`). We >> plan to enable partition done marking with the following core >> configurations: - `partition.mark-done-action`: >> `success-file,done-partition` (generate `_success` file in the partition >> directory and record done status in metastore) - >> `partition.idle-time-to-done`: `1h` (trigger done mark when a partition has >> no new data for 1 hour, adjustable based on business latency) - >> `partition.mark-done-action.mode`: `watermark` (to ensure accurate done >> timing for delayed data) 2. Stream-Batch Integration Architecture (Two Flink >> Jobs) I want to design two Flink jobs to fulfill both real-time and offline >> analytics requirements: - Real-Time Reporting Job: A streaming Flink job >> that reads the Paimon table with the default `scan.mode = 'latest'`. It >> consumes incremental data continuously and outputs to real-time dashboards, >> unaffected by the `done` mark. - Offline Batch Reporting Job: A batch Flink >> job intended to process only partitions marked as `done` (to ensure data >> completeness) and write results to offline data warehouses (e.g., Hive). >> However, I’m unsure about the specific operational steps for the batch >> job—particularly how to accurately target, read, and process only the >> partitions that have been marked as done. 3. Key Challenges & Requests for >> Guidance/Examples While we understand the basic functionality of >> `partition-mark-done` (generating `_success` files or metastore marks), >> we’re stuck on the practical implementation of the offline batch job. >> Specifically: 1. How can a Flink SQL batch job **accurately identify and >> consume the latest `done` partitions** (filtered by `dt` and `hour`)? Are >> there built-in functions or configuration parameters to directly filter >> `done` partitions? 2. What are the best practices to **avoid duplicate >> processing** of already processed `done` partitions in the batch job? 3. Are >> there reference cases, sample Flink SQL scripts, or integration examples >> with scheduling systems (e.g., Airflow, DolphinScheduler) to trigger batch >> jobs based on the `done` status of Paimon partitions? We’ve searched the >> official documentation and existing community discussions but haven’t found >> detailed implementation guidance for this specific stream-batch integration >> scenario. Any practical examples, code snippets, or architectural >> suggestions would be greatly appreciated. Additional context: We’re using >> Flink 1.20 and Paimon 1.2.0. Thank you for your time and support! Looking >> forward to your valuable insights. Best regards, Q7
