Hi devs and users,

It looks like we are getting an initial consensus in the discussion so I
started a voting thread [1] just now. Looking forward to your feedback!

[1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6

Best,
Qingsheng


On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <j...@ververica.com> wrote:

> Hi Qingsheng,
>
> Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut
> and numXXXSend to the same external data transfer metric does not really
> break the new SinkV2 design, since there was no requirement to monitor the
> internal traffic. So, I think both developer and user can live with it. It
> might not be the perfect solution but is indeed the currently best
> trade-off solution after considering the backward compatibility.  I would
> suggest firing a follow-up ticket after the PR to take care of the new
> metric for the internal traffic in the future.
>
> Best regards,
> Jing
>
>
> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi Jing,
>>
>> Thanks for the reply!
>>
>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>> SinkWriterOperator to reflect the traffic to the external system for
>> compatibility with old versions before 1.15, and make numXXXSend have the
>> same value as numXXXOut for compatibility within 1.15. That means both
>> numXXXOut and numXXXSend are used for external data transfers, which end
>> users care more about. As for the internal traffic within the sink, we
>> could name a new metric for it because this is a _new_ feature in the _new_
>> sink, and end users usually don’t pay attention to internal implementation.
>> The name of the new metric could be discussed later after 1.16 release.
>>
>> > but it might end up with monitoring unexpected metrics, which is even
>> worse for users, i.e. I didn't change anything, but something has been
>> broken since the last update.
>>
>> Yeah this is exactly what we are trying to fix with this proposal. I
>> believe users are more concerned with the output to the external system
>> than the internal data delivery in the sink, so I think we’ll have more
>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>> the output of the job, but after upgrading to 1.15 this value is extremely
>> low and I didn’t change anything” if we stick to the current situation. I
>> think only a few end users care about the number of committables sending to
>> downstream as most of them don’t care how the sink works.
>>
>> We do need a re-design to fully distinguish the internal and external
>> traffic on metrics, not only in sink but in all operators as it’s quite
>> common for operators to make IO. This needs time to design, discuss, adjust
>> and vote, but considering this is blocking 1.16, maybe it’s better to
>> rescue the compatibility for now, and leave the huge reconstruction to
>> future versions (maybe 2.0).
>>
>> Best,
>> Qingsheng
>>
>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <j...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Just want to make sure we are on the same page. Are you suggesting
>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>
>>> For the naming switch, please pay attention that the behaviour has been
>>> changed since we introduced SinkV2[1]. So, please be aware of different
>>> numbers(behaviour change) even with the same metrics name. Sticking with
>>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>>> saving the effort in the first place, but it might end up with monitoring
>>> unexpected metrics, which is even worse for users, i.e. I didn't change
>>> anything, but something has been broken since the last update.
>>>
>>> For reverting, I am not sure how to fix the issue mentioned in
>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>> out, with SinkV2 we have two different output lines - one with the external
>>> system and the other with the downstream operator. In this case,
>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>> monitor.
>>>
>>>
>>> Best regards,
>>> Jing
>>>
>>> [1]
>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>
>>>
>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> As a supplement, considering it could be a big reconstruction
>>>> redefining internal and external traffic and touching metric names in
>>>> almost all operators, this requires a lot of discussions and we might
>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>> in front of us, as the output of sink is a metric that users care a
>>>> lot about.
>>>>
>>>> Thanks,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>>>> >
>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>> > metric definition distinguishing internal and external data transfers.
>>>> > As you described, having IO in operators is quite common such as
>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>>>> > an overall design.
>>>> >
>>>> > However I think it's necessary to change the metric back to
>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>> > 1.16. The most important argument is for compatibility as I mentioned
>>>> > in my previous email, otherwise all users have to modify their configs
>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>> > connectors have to change their implementations to migrate to the new
>>>> > metric name. I believe other ones participating and approving this
>>>> > proposal share the same concern about compatibility too. Also
>>>> > considering this issue is blocking the release of 1.16, maybe we could
>>>> > fix this asap, and as for defining a new metric for internal data
>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>> >
>>>> > Best,
>>>> > Qingsheng
>>>> >
>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ches...@apache.org>
>>>> wrote:
>>>> > >
>>>> > > Currently I think that would be a mistake.
>>>> > >
>>>> > > Ultimately what we have here is the culmination of us never really
>>>> considering how the numRecordsOut metric should behave for operators that
>>>> emit data to other operators _and_ external systems. This goes beyond 
>>>> sinks.
>>>> > > This even applies to numRecordsIn, for cases where functions
>>>> query/write data from/to the outside, (e.g., Async IO).
>>>> > >
>>>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>>>> transfers, and 1 exclusively for external data transfers, is the only way
>>>> to get a consistent metric definition in the long-run.
>>>> > > We can jump back-and-forth now or just commit to it.
>>>> > >
>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>> operator in a chain anymore.
>>>> > >
>>>> > > What I would suggest is to stick with what we got (although I
>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>> all non-TwoPhaseCommittingSink.
>>>> > >
>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks for the details Chesnay!
>>>> > >
>>>> > > By “alias” I mean to respect the original definition made in
>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>> for compatibility.
>>>> > >
>>>> > > I think keeping numRecordsOut for the output to the external system
>>>> is more intuitive to end users because in most cases the metric of data
>>>> flow output is more essential. I agree with you that a new metric is
>>>> required, but considering compatibility and users’ intuition I prefer to
>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>> metric for sink writer’s output to downstream operators. This might be
>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>> acceptable to have the sink as a special case.
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ches...@apache.org>,
>>>> wrote:
>>>> > >
>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut
>>>> > >
>>>> > > But that's not possible. If it were that simple there would have
>>>> never been a need to introduce another metric in the first place.
>>>> > >
>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>> that they emit data to the external system (usually considered as
>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>> both (which is completely wrong).
>>>> > >
>>>> > > A new metric was always required; otherwise you inevitably end up
>>>> breaking some semantic.
>>>> > > Adding a new metric for what the sink writes to the external system
>>>> is, for better or worse, more consistent with how these metrics usually
>>>> work in Flink.
>>>> > >
>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks everyone for joining the discussion!
>>>> > >
>>>> > > > Do you have any idea what has happened in the process here?
>>>> > >
>>>> > > The discussion in this PR [1] shows some details and could be
>>>> helpful to understand the original motivation of the renaming. We do have a
>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>> so the defense was broken.
>>>> > >
>>>> > > I think the reason why both the developer and the reviewer forgot
>>>> to trigger an discussion and gave a green pass on the change is that
>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>> should be treated carefully while contributing and reviewing.
>>>> > >
>>>> > > IMHO three actions could be made to prevent this kind of changes in
>>>> the future:
>>>> > >
>>>> > > a. Add test case for metrics (which we already have in
>>>> SinkMetricsITCase)
>>>> > > b. We emphasize that any public-interface breaking changes should
>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>> the release note.
>>>> > > c. We remind contributors and reviewers about what should be
>>>> considered as public API, and include metric names in it.
>>>> > >
>>>> > > For b and c these two pages [2][3] might be proper places.
>>>> > >
>>>> > > About the patch to revert this, it looks like we have a consensus
>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>> didn’t see complaints about this for now so it should be OK to save the
>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut considering there could possibly some users have already adapted
>>>> their system to the new naming, and have another internal metric for
>>>> reflecting number of outgoing committable batches (actually the
>>>> numRecordsIn of sink committer operator should be carrying this info
>>>> already).
>>>> > >
>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <tonysong...@gmail.com>,
>>>> wrote:
>>>> > >
>>>> > > +1 for reverting these changes in Flink 1.16.
>>>> > >
>>>> > > For 1.15.3, can we make these metrics available via both names
>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>> internal to Flink sink.
>>>> > >
>>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>>> migrate to a new Flink release immediately, until the next 1-2 major
>>>> releases are out.
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > Xintong
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>> martijnvis...@apache.org> wrote:
>>>> > >>
>>>> > >> Hi Qingsheng,
>>>> > >>
>>>> > >> Do you have any idea what has happened in the process here? Do we
>>>> know why
>>>> > >> they were changed? I was under the impression that these metric
>>>> names were
>>>> > >> newly introduced due to the new interfaces and because it still
>>>> depends on
>>>> > >> each connector implementing these.
>>>> > >>
>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>>> public
>>>> > >> API. Might make sense to have a separate follow-up to add that to
>>>> the list
>>>> > >> (I do think we should list them there).
>>>> > >>
>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>> > >>
>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>>> think the
>>>> > >> impact is too big for a patch version, especially given how long
>>>> Flink 1.15
>>>> > >> is already out there.
>>>> > >>
>>>> > >> Best regards,
>>>> > >>
>>>> > >> Martijn
>>>> > >>
>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xbjt...@gmail.com>
>>>> wrote:
>>>> > >>
>>>> > >> > Thanks Qingsheng for starting this thread.
>>>> > >> >
>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>>> > >> > inconsistent behavior.
>>>> > >> >
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Leonard
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <imj...@gmail.com> 写道:
>>>> > >> >
>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>> > >> >
>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>> > >> >
>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>>> and old
>>>> > >> > sink is inconsistent.
>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>> V2 is an
>>>> > >> > evolving API which is just introduced in 1.15.
>>>> > >> > I think it makes sense to break the unstable API instead of the
>>>> stable API
>>>> > >> > which many connectors and users depend on.
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Jark
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>> jingsongl...@gmail.com> wrote:
>>>> > >> >
>>>> > >> >> Thanks for driving, Qingsheng.
>>>> > >> >>
>>>> > >> >> +1 for reverting sink metric name.
>>>> > >> >>
>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>> > >> >>
>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>> > >> >>
>>>> > >> >> Best,
>>>> > >> >> Jingsong
>>>> > >> >>
>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>> becket....@gmail.com> wrote:
>>>> > >> >> >
>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>> > >> >> >
>>>> > >> >> > +1 on reverting the breaking changes.
>>>> > >> >> >
>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>> and update
>>>> > >> >> the previous release docs with this known issue, so that users
>>>> can upgrade
>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>> backwards
>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>> changes like
>>>> > >> >> this in the future.
>>>> > >> >> >
>>>> > >> >> > Thanks,
>>>> > >> >> >
>>>> > >> >> > Jiangjie (Becket) Qin
>>>> > >> >> >
>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>> re...@apache.org> wrote:
>>>> > >> >> >>
>>>> > >> >> >> Hi devs and users,
>>>> > >> >> >>
>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>> change about
>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>> [2].
>>>> > >> >> >>
>>>> > >> >> >> TL;DR
>>>> > >> >> >>
>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>> are replace
>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>> metric names
>>>> > >> >> are public APIs, this is a breaking change to end users and not
>>>> backward
>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>> discussed in
>>>> > >> >> the mailing list before.
>>>> > >> >> >>
>>>> > >> >> >> Background
>>>> > >> >> >>
>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>> changed so
>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>> is used for
>>>> > >> >> reporting the total number of output records since the sink
>>>> started (number
>>>> > >> >> of records written to the external system), and similarly for
>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>>>> and
>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>>> definition.
>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>> “numXXXOut”
>>>> > >> >> could be used by the output of SinkWriterOperator for reporting
>>>> number of
>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>> resolve the
>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>> metrics with
>>>> > >> >> “numXXXSend”.
>>>> > >> >> >>
>>>> > >> >> >> Necessity of reverting this change
>>>> > >> >> >>
>>>> > >> >> >> - Metric names are actually public API, as end users need to
>>>> configure
>>>> > >> >> metric collecting and alerting system with metric names. Users
>>>> have to
>>>> > >> >> reset all configurations related to affected metrics.
>>>> > >> >> >> - This could also affect custom and external sinks not
>>>> maintained by
>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>> > >> >> >> - The number of records sent to external system is way more
>>>> important
>>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>>> as the
>>>> > >> >> latter one is just an internal implementation of sink. We could
>>>> have a new
>>>> > >> >> metric name for the latter one instead.
>>>> > >> >> >> - We could avoid splitting the project by version (like “plz
>>>> use
>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>> it ASAP,
>>>> > >> >> cosidering 1.16 is still not released for now.
>>>> > >> >> >>
>>>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>>>> your
>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>> > >> >> >>
>>>> > >> >> >> Looking forward to your reply!
>>>> > >> >> >>
>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>> > >> >> >> [1] FLIP-33, version 18:
>>>> > >> >>
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>> > >> >> >>
>>>> > >> >> >> Best,
>>>> > >> >> >> Qingsheng
>>>> > >> >>
>>>> > >> >
>>>> > >> >
>>>> > >
>>>> > >
>>>> > >
>>>>
>>>

Reply via email to