[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724113#comment-17724113 ] Weijie Guo commented on FLINK-31963: It seems that all pull request has been merged but this ticket not closed. Include this fix in rc1 of 1.16.2 and close it. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0, 1.16.1, 1.15.4, 1.18.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Fix For: 1.16.2, 1.18.0, 1.17.1 > > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723126#comment-17723126 ] Piotr Nowojski commented on FLINK-31963: To clarify impact of this bug. This is a rare issue that can happen in every not backpressured job. The problem is that if the input buffers of a downstream subtask are empty AND the output buffers of the upstream subtask are not empty, then in-flight data are incorrectly restored from such checkpoint during a recovery attempt combined with rescaling. This can lead to variety of issues: * ArrayIndexOutOfBoundException when downscaling (as reported here) * in-flight records sent to incorrect downstream subtasks during scaling up or down. This for keyed exchanges will cause an immediate failure when trying to match key group on the downstream subtask. For non keyed exchanges the misalignment can remain undetected, causing incorrect results. Checkpoint itself is not corrupted, so recovery attempt without rescaling would work without without problems. Also recovery and rescaling from such checkpoint using a Flink version that has this bug fixed will also work correctly. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0, 1.16.1, 1.15.4, 1.18.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723051#comment-17723051 ] Stefan Richter commented on FLINK-31963: Yes, PR is currently in review here: https://github.com/apache/flink/pull/22584 > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0, 1.16.1, 1.15.4, 1.18.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723037#comment-17723037 ] Qingsheng Ren commented on FLINK-31963: --- [~srichter] Is there any updates on this issue? Thanks > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0, 1.16.1, 1.15.4, 1.18.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722120#comment-17722120 ] Stefan Richter commented on FLINK-31963: Seems that this is similar to the problem described in FLINK-27031. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721788#comment-17721788 ] Stefan Richter commented on FLINK-31963: I have a local reproducer as well as a fix, will open a PR once I have written the tests. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721777#comment-17721777 ] Piotr Nowojski commented on FLINK-31963: We have managed to reproduce and find the bug. Thank you for reporting the issue and help with analysing [~tanee.kim] and [~masteryhx]. We are now working on fixing it. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Assignee: Stefan Richter >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721592#comment-17721592 ] Hangxiang Yu commented on FLINK-31963: -- [~srichter] No, I haven't used side-outputs. The problematic nodes / connection in my job: keyedProcessFunction -> sink (The partitioner type is rebalance). The exception is thrown while scaling down sink node. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721387#comment-17721387 ] Stefan Richter commented on FLINK-31963: [~masteryhx] Did your job also make use of side-outputs? Just fishing among things that are potentially "unusual" about the jobs. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721179#comment-17721179 ] Hangxiang Yu commented on FLINK-31963: -- Hi, [~pnowojski]. I am a bit sure that it may not be related to unified file mergeing of unaligned checkpoints because I meet above exception in 1.15. My job is a bit complicated so I tried to simplify it to reproduce it. But I haven't currently. I will share more if I can reproduce it by a simple job or an ITCase. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720610#comment-17720610 ] Piotr Nowojski commented on FLINK-31963: Additionally to what [~srichter] asked. [~tanee.kim], would it be possible for you to provide the checkpoint files from when the failure was happening, so that we could reproduce it more easily? > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720596#comment-17720596 ] Stefan Richter commented on FLINK-31963: Hi, just to clarify: when you say a checkpoint that fails once fails always - does this only apply for restore with rescaling or can you also not recover from the CP when the parallelism remains unchanged? If it only happens with rescaling, can you at least recover for some parallelism values or for no change at all? > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720049#comment-17720049 ] Tan Kim commented on FLINK-31963: - Q1 If it happens once during recovery from an unaligned checkpoint, it will always happen from the same checkpoint. Q2 If the numRecordsOut metric applies to all operators, including chaining, then I may have jumped the gun. Since scaling usually takes time and source & downstream scaling conditions are different, I guess I should have monitored it more closely. Thanks for answering my question. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718610#comment-17718610 ] Piotr Nowojski commented on FLINK-31963: Thanks for the answers [~tanee.kim] {quote} It happens a couple of times, but not always. {quote} But once it happened once during a recovery from an unaligned checkpoint, does it happen always for that same checkpoint? Or even that is indeterministic and retrying recovery from the same checkpoint can sucede? {quote} A question unrelated to this ticket, but if the subtasks that exist in the above jobgraph all appear to be one, why is that? In order to do source scaling, the outputRecords value needs to be non-zero, but since the downstream after the kafka source stream is not separated on the jobgraph, the outputRecords is getting zero, so we explicitly added a keyBy operator to the kafka source stream so that we can intentionally separate them and then calculate the outputRecords value. (I don't think this is very good for performance) Is there any other way to ensure that the streams are separated into two at the desired location in the jobgraph? {quote} You can brake chains via {{startNewChain}} or {{disableChaining}} https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups . However this doesn't seem like the right think to do. What do you mean by {{outputRecords}}? {{numRecordsOut}} metric should be available for all operators, including chained source operators. > java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned > checkpoints > - > > Key: FLINK-31963 > URL: https://issues.apache.org/jira/browse/FLINK-31963 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.17.0 > Environment: Flink: 1.17.0 > FKO: 1.4.0 > StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint > enabled) >Reporter: Tan Kim >Priority: Critical > Labels: stability > Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, > taskmanager_error.txt > > > I'm testing Autoscaler through Kubernetes Operator and I'm facing the > following issue. > As you know, when a job is scaled down through the autoscaler, the job > manager and task manager go down and then back up again. > When this happens, an index out of bounds exception is thrown and the state > is not restored from a checkpoint. > [~gyfora] told me via the Flink Slack troubleshooting channel that this is > likely an issue with Unaligned Checkpoint and not an issue with the > autoscaler, but I'm opening a ticket with Gyula for more clarification. > Please see the attached JM and TM error logs. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)