# Subject: Seeking Guidance: Implementing Stream-Batch Integration with
Paimon's Partition-Mark-Done Dear Paimon Community, I hope this message finds
you well. I’m reaching out to seek insights and practical guidance on
leveraging Paimon’s `partition-mark-done` feature to build a stream-batch
integrated pipeline. Below is a detailed overview of our current setup,
configuration plans, and key questions that require clarification. 1. Paimon
Source Table Configuration (Partition-Mark-Done) Our Paimon table is
partitioned by `dt` (date, format: `yyyyMMdd`) and `hour` (format: `HH`). We
plan to enable partition done marking with the following core configurations: -
`partition.mark-done-action`: `success-file,done-partition` (generate
`_success` file in the partition directory and record done status in metastore)
- `partition.idle-time-to-done`: `1h` (trigger done mark when a partition has
no new data for 1 hour, adjustable based on business latency) -
`partition.mark-done-action.mode`: `watermark` (to ensure accurate done timing
for delayed data) 2. Stream-Batch Integration Architecture (Two Flink Jobs) I
want to design two Flink jobs to fulfill both real-time and offline analytics
requirements: - Real-Time Reporting Job: A streaming Flink job that reads the
Paimon table with the default `scan.mode = 'latest'`. It consumes incremental
data continuously and outputs to real-time dashboards, unaffected by the `done`
mark. - Offline Batch Reporting Job: A batch Flink job intended to process only
partitions marked as `done` (to ensure data completeness) and write results to
offline data warehouses (e.g., Hive). However, I’m unsure about the specific
operational steps for the batch job—particularly how to accurately target,
read, and process only the partitions that have been marked as done. 3. Key
Challenges & Requests for Guidance/Examples While we understand the basic
functionality of `partition-mark-done` (generating `_success` files or
metastore marks), we’re stuck on the practical implementation of the offline
batch job. Specifically: 1. How can a Flink SQL batch job **accurately identify
and consume the latest `done` partitions** (filtered by `dt` and `hour`)? Are
there built-in functions or configuration parameters to directly filter `done`
partitions? 2. What are the best practices to **avoid duplicate processing** of
already processed `done` partitions in the batch job? 3. Are there reference
cases, sample Flink SQL scripts, or integration examples with scheduling
systems (e.g., Airflow, DolphinScheduler) to trigger batch jobs based on the
`done` status of Paimon partitions? We’ve searched the official documentation
and existing community discussions but haven’t found detailed implementation
guidance for this specific stream-batch integration scenario. Any practical
examples, code snippets, or architectural suggestions would be greatly
appreciated. Additional context: We’re using Flink 1.20 and Paimon 1.2.0. Thank
you for your time and support! Looking forward to your valuable insights. Best
regards, Q7