Re: Question about source throughput semantics when reading Paimon with consumer-id

lec ssmi Tue, 03 Feb 2026 06:17:26 -0800

Hi Jingsong,

Thanks a lot for the quick reply and for pointing me to the Paimon metrics
documentation — that is very helpful.

I fully agree that, at the connector level, relying on the metrics exposed
by the Flink Paimon Source is the most accurate way to observe its runtime
behavior. However, from a platform or engine-agnostic monitoring
perspective, this also highlights a broader challenge.

If source throughput detection relies on connector-specific metrics, then
monitoring inevitably becomes customized per connector. In practice, Flink
has many connectors, and each of them may expose different metrics, follow
different execution models, or even change behavior depending on
configuration. Paimon is a good example of this: different consumption
modes (e.g., with or without consumer-id) lead to quite different execution
graphs and operator responsibilities.

>From the perspective of a generic Flink platform or monitoring system,
implementing and maintaining connector-specific logic for each source type
(Kafka, Paimon, filesystem, etc.), and even for different modes of the same
connector, is not very friendly or scalable.

This is why I’m particularly interested in whether there is (or could be) a
more stable semantic definition of a “logical ingress point” in Flink — one
that monitoring systems could rely on without deeply understanding each
connector’s internal implementation. Today, the physical JobGraph structure
and the actual data-ingesting operator do not always align in a uniform way
across connectors.

I understand this may be more of a Flink-level abstraction question rather
than a Paimon-only one, but Paimon’s design makes the issue especially
visible.

Thanks again for the clarification and for the great work on Paimon.

Best regards,
Lec

Jingsong Li <[email protected]> 于2026年2月2日周一 10:39写道：

> Hi Lec,
>
> You can take a look at Flink Paimon's Metrics, which contains a wealth
> of metrics. Paimon is just a lake format, so it doesn't have any
> metrics, but the Flink Source we implemented produces a large number
> of metrics.
>
> See doc: https://paimon.apache.org/docs/master/maintenance/metrics/
>
> Best,
> Jingsong
>
> On Thu, Jan 29, 2026 at 10:17 PM lec ssmi <[email protected]> wrote:
> >
> > Hi Paimon community,
> >
> > I have a question regarding the runtime execution model and throughput
> semantics when reading Paimon tables in streaming mode with consumer-id.
> >
> > From my understanding and observations, when consumer-id is specified,
> the execution graph generated by Flink is different from some other common
> sources (e.g. Kafka). Instead of a single source operator that directly
> emits records, the graph usually contains:
> >
> > - A monitor-like source (often with parallelism = 1), which tracks
> snapshot changes and produces snapshot/split events
> > - One or more downstream read operators, which receive those splits and
> perform the actual file reading, emitting the real RowData records
> >
> > In this setup, the “source” node in the execution graph mainly emits
> metadata events (snapshot IDs / splits), while the real data throughput is
> produced by the downstream read operators.
> >
> > This leads to a practical issue for platform-level monitoring tools. In
> many Flink platforms, source throughput (records/s, bytes/s) is commonly
> measured by observing the source vertex metrics. That approach works well
> for sources like Kafka, where the source operator itself emits user
> records. However, in the Paimon + consumer-id case, monitoring only the
> source vertex seems misleading, because it does not reflect the actual data
> ingestion rate.
> >
> > So my questions are:
> >
> > 1. Is this monitor + reader split in the execution graph an intentional
> and stable design for Paimon streaming reads with consumer-id?
> > 2. From the Paimon/Flink semantics perspective, which operator should be
> considered the “ingress point” for measuring real data throughput?
> > 3. Is there any recommended or documented way for external monitoring
> systems to correctly identify the operator that represents actual data
> ingestion when reading from Paimon?
> >
> > The motivation here is to build a connector-agnostic source rate
> detection mechanism, and understanding the intended semantics on the Paimon
> side would be very helpful.
> >
> > Thanks in advance for your insights, and thanks for the great work on
> Paimon.
> >
> > Best regards.
>

Re: Question about source throughput semantics when reading Paimon with consumer-id

Reply via email to