What confuses me most is why, when I set the snapshot ID to the current latest, the job still produces records. From the documentation, my understanding is that in streaming mode this usage should not consume that snapshot itself, but rather use it as the starting point to consume incremental changes. This should also be one of the main differences between from-snapshot and from-snapshot-full, right?
Yunfeng Zhou <[email protected]> 于 2025年10月11日周六 11:15写道: > Hi lec ssmi, > > Most of your questions and understandings might be addressed by the > document of the configuration `scan.mode`. You can find it here > https://paimon.apache.org/docs/master/maintenance/configurations/ > It explains differences between modes like "from-snapshot” and > “from-timestamp-full” (corresponds to your exclusive/inclusive converns) > and the different behaviors between batch and streaming mode. > > The streaming read result does differ according to the changelog producer > of the table. You can find the corresponding behaviors here > https://paimon.apache.org/docs/master/primary-key-table/changelog-producer/ > > Best, > Yunfeng > > 2025年10月11日 09:38,lec ssmi <[email protected]> 写道: > > Hi Paimon community, > > I’d like to confirm the intended semantics of *scan.snapshot-id* in > Apache Paimon *1.2* when reading with Flink. > What I see > > - Table’s latest snapshot ID is *53*. > - I start a *streaming* query with: > > SET 'execution.runtime-mode' = 'streaming'; > SELECT * FROM my_table /*+ OPTIONS('scan.snapshot-id'='53') */; > > - The job *emits records immediately*, even though there is *no > snapshot 54* yet. After that initial output, it waits for new > snapshots and continues normally when new data arrives. > > My understanding > > - In streaming mode, when scan.snapshot-id is provided (and scan.mode > defaults to from-snapshot), the source *reads changes starting from > that snapshot* (i.e., includes the changes produced by snapshot S > itself, then S+1, S+2, …), and it *does not* first produce a full > snapshot at startup. > - In batch mode, using scan.snapshot-id = S should return the *full > view of snapshot S* only (no subsequent changes, no waiting). > > Questions > > 1. *Streaming semantics:* Is it *by design* that from-snapshot is > *inclusive* of the starting snapshot’s changes (ΔS), i.e., it will > output ΔS even if there is no S+1 yet? > 2. *Batch semantics:* Is it correct that batch + scan.snapshot-id = S > always returns *Full(S)* (and never ΔS)? Are there any exceptions > depending on table type? > 3. *PK vs non-PK tables:* Should we expect any observable difference > here depending on whether the table is a primary-key table and/or the > changelog-producer is enabled (lookup, full-compaction, input)? > 4. *Exclusive start recommendation:* If a user wants to *strictly > start from S+1* (i.e., exclude ΔS), is the recommended approach to: > - wait until S+1 exists and set scan.snapshot-id = S+1, or > - use incremental-between='S,S+1' for a bounded read? > 5. *Docs wording:* If the above is the intended behavior, would it > make sense to emphasize the *inclusive* nature of from-snapshot in > streaming (vs. the *(start, end]* semantics of incremental-between) to > help users avoid confusion? > > Environment: > > - Paimon: 1.2 > - Engine: Flink 1.19 > - scan.mode: default (from-snapshot) > > Thanks a lot for confirming the expected behavior . > > Best regards > > >
