[jira] [Commented] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402730#comment-17402730 ] Biao Geng commented on FLINK-23556: --- hi [~xtsong], I think my pull request is ready. I tried 3 times of e2e tests and this case works fine. I would appreciate a lot if you can help me to find a reviewer for it. Thanks! The pr link: [https://github.com/apache/flink/pull/16864|https://github.com/apache/flink/pull/16864] > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: pull-request-available, stale-blocker, test-stability > Fix For: 1.14.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401174#comment-17401174 ] Biao Geng edited comment on FLINK-23556 at 8/18/21, 4:09 PM: - Hi, [~xtsong] . I may need one more day to verify the PR after today's discussion with [~renqs] . Once I think it is ready, I will contact you for a review. Thanks! was (Author: bgeng777): Hi, [~xtsong] . I may need one day to verify the PR after today's discussion with [~renqs] . Once I think it is ready, I will contact you for a review. Thanks! > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: pull-request-available, stale-blocker, test-stability > Fix For: 1.14.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401174#comment-17401174 ] Biao Geng commented on FLINK-23556: --- Hi, [~xtsong] . I may need one day to verify the PR after today's discussion with [~renqs] . Once I think it is ready, I will contact you for a review. Thanks! > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: pull-request-available, stale-blocker, test-stability > Fix For: 1.14.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400465#comment-17400465 ] Biao Geng commented on FLINK-23556: --- hi [~xtsong], thanks a lot for your support! I have dived into this IT case for some days and create a [PR|[https://github.com/apache/flink/pull/16864]] to make it more stable. But my current fix does not actually guarantee the success of this case. It should just make it more stable. My summary: * AFAIK, the logic and implementation of this case is correct. * Due to the complexity of this case, the running time of this IT case should not be limited to 120 seconds. Due to my own tests, even only the running time of the internal flink job can exceed the limit easily. * I believe the failure of this case should not be caused by flink. Instead, I doubt the network issue of test containers(i.e. a kafka container, a flink container and a schemaRegistry container) in this case is more likely be the root cause. But I need more time to verify this guess. > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: pull-request-available, stale-blocker, test-stability > Fix For: 1.14.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402730#comment-17402730 ] Biao Geng edited comment on FLINK-23556 at 8/24/21, 4:29 AM: - hi [~xtsong], I think my pull request is ready. I tried 3 times of e2e tests and this case works fine. I would appreciate a lot if you can help me to find a reviewer for it. Thanks! The pr link: [https://github.com/apache/flink/pull/16952] was (Author: bgeng777): hi [~xtsong], I think my pull request is ready. I tried 3 times of e2e tests and this case works fine. I would appreciate a lot if you can help me to find a reviewer for it. Thanks! The pr link: [https://github.com/apache/flink/pull/16864|https://github.com/apache/flink/pull/16864] > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: pull-request-available, stale-blocker, test-stability > Fix For: 1.14.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23556) SQLClientSchemaRegistryITCase fails with " Subject ... not found"
[ https://issues.apache.org/jira/browse/FLINK-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396751#comment-17396751 ] Biao Geng commented on FLINK-23556: --- I have read the test codes and I doubt that the port in the `avro-confluent.url` may be wrong due to the port mapping of the container. I will discuss with [~renqs] to verify the guess. > SQLClientSchemaRegistryITCase fails with " Subject ... not found" > - > > Key: FLINK-23556 > URL: https://issues.apache.org/jira/browse/FLINK-23556 > Project: Flink > Issue Type: Bug > Components: Table SQL / Ecosystem >Affects Versions: 1.14.0 >Reporter: Dawid Wysakowicz >Assignee: Biao Geng >Priority: Blocker > Labels: stale-blocker, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=cc5499f8-bdde-5157-0d76-b6528ecd808e=25337 > {code} > Jul 28 23:37:48 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 209.44 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase > Jul 28 23:37:48 [ERROR] > testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) > Time elapsed: 81.146 s <<< ERROR! > Jul 28 23:37:48 > io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: > Subject 'test-user-behavior-d18d4af2-3830-4620-9993-340c13f50cc2-value' not > found.; error code: 40401 > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760) > Jul 28 23:37:48 at > io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.getAllVersions(SQLClientSchemaRegistryITCase.java:230) > Jul 28 23:37:48 at > org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195) > Jul 28 23:37:48 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jul 28 23:37:48 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jul 28 23:37:48 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jul 28 23:37:48 at java.lang.reflect.Method.invoke(Method.java:498) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jul 28 23:37:48 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jul 28 23:37:48 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > Jul 28 23:37:48 at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > Jul 28 23:37:48 at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > Jul 28 23:37:48 at java.lang.Thread.run(Thread.java:748) > Jul 28 23:37:48 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446258#comment-17446258 ] Biao Geng commented on FLINK-24897: --- Hi [~trohrmann] , would you mind giving any comments on if we have the promise that jars under {{usrlib}} will always be loaded by user classloader? Besides, after above discussion with Yang, the current solution is: 1. {{usrlib}} will be shipped automatically if it exists. 2. If we add {{usrlib}} in ship files again, we will throw exception. 3. {{usrlib}} will work for both per job and application mode. 4. Jars in {{usrlib}} will be loaded by user classloader only when UserJarInclusion is DISABLED. In other cases, AppClassLoader will be used. Thanks. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450424#comment-17450424 ] Biao Geng commented on FLINK-24897: --- Thank you very much. I will follow above discussion to create a pull request in this(hopefully) or next week. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Description: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a solution that users can use `usrlib` directory to store their user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. was: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [FLINK-21289|https://issues.apache.org/jira/browse/FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result,
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443680#comment-17443680 ] Biao Geng commented on FLINK-24897: --- Hi, [~trohrmann] [~wangyang0918] [~xtsong], very sorry to bother you but you folks are expert in this area and would you mind giving any feedback on this issue? Thanks a lot! I am also willing to do some work on this issue if we decide to add such ability. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-24897) Enable application mode on YARN to use usrlib
Biao Geng created FLINK-24897: - Summary: Enable application mode on YARN to use usrlib Key: FLINK-24897 URL: https://issues.apache.org/jira/browse/FLINK-24897 Project: Flink Issue Type: Improvement Components: Deployment / YARN Reporter: Biao Geng Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by all jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [#FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Description: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a solution that users can use `usrlib` directory to store their user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do seems to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. was: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a solution that users can use `usrlib` directory to store their user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by
[jira] [Updated] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Description: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a solution that users can use `usrlib` directory to store their user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do seems to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. was: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a solution that users can use `usrlib` directory to store their user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do seems to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by
[jira] [Updated] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Description: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [FLINK-21289|https://issues.apache.org/jira/browse/FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. was: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [#FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like > [FLINK-21289|https://issues.apache.org/jira/browse/FLINK-21289]. In k8s mode, > there is a solution that users can use `usrlib` directory to store there > user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader > when the job is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped
[jira] [Updated] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Description: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by some jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [#FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. was: Hi there, I am working to utilize application mode to submit flink jobs to YARN cluster but I find that currently there is no easy way to ship my user-defined jars(e.g. some custom connectors or udf jars that would be shared by all jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. I checked some relevant jiras, like [#FLINK-21289]. In k8s mode, there is a solution that users can use `usrlib` directory to store there user-defined jars and these jars would be loaded by FlinkUserCodeClassLoader when the job is executed on JM/TM. But on YARN mode, `usrlib` does not work as that: In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my local machine) to remote cluster, I must not set UserJarInclusion to DISABLED due to the checkArgument(). However, if I do not set that option to DISABLED, the user jars to be shipped will be added into systemClassPaths. As a result, classes in those user jars will be loaded by AppClassLoader. But if I do not ship these jars, there is no convenient way to utilize these jars in my flink run command. Currently, all I can do is to use `-C` option, which means I have to upload my jars to some shared store first and then use these remote paths. It is not so perfect as we have already make it possible to ship jars or files directly and we also introduce `usrlib` in application mode on YARN. It would be more user-friendly, if we can allow shipping `usrlib` from local to remote cluster while using FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like [#FLINK-21289]. In k8s mode, there is a > solution that users can use `usrlib` directory to store there user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445748#comment-17445748 ] Biao Geng commented on FLINK-24897: --- [~wangyang0918] Thanks for the reply! I also think about throwing exception when considering the 3 options. I can see its good points. My only concern is that it is somehow different with what we have done when users specify {{lib}} in ship files as well. I do not find constraints on specifying {{lib}} when shipping files. If we do not need to worry about that, reporting error makes senses to me as well. For your third point, I believe it will bring least changes to our current codebase. But I just have one question about the history of {{usrlib}} : do we have any contract with users that jars in {{usrlib}} will be loaded by user classloader by default? If the answer is false, I think your solution is the cleanest one. BTW, the name of {{yarn.per-job-cluster.include-user-jar}} may be changed to something like \{{yarn.include-user-jar}} if we achieve the consensus about point 3. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445631#comment-17445631 ] Biao Geng commented on FLINK-24897: --- Hi [~trohrmann] and [~wangyang0918] thank you very much for your reply. I agree with Till's suggestion about reusing the existing logic to include {{usrlib}} in user classloader. Yang's questions are also helpful and critical: *A summary of my answer abourt {{{}usrlib{}}}:* 0. We should ship {{usrlib}} by default like what we have done for {{lib}} dir. 1. We should avoid uploading it again and not add classes in it into system path if users specify {{usrlib}} again in the {{yarn.ship-files}} option. 2. It should work for per-job mode 3. Only when UserJarInclusion is DISABLED will {{usrlib}} take effect in per-job mode. But we should consider the default value of {{UserJarInclusion}} option. *Datail:* Q1: Currently, I think we should ship {{usrlib}} by default if it exists because AFAIK, {{usrlib}} is the default userClassPath which is defined by flink. If we ask the user to explicitly specify it, it is somehow waste the flink's contract with users. When users specify a shipped directory named as "usrlib", I think there are 3 options: Option1: skip it Option2: report error Option3: do nothing but just upload it and add files in {{usrlib}} into system classpaths Option1 seems to be easiest, just as what we have done for {{flink_dist.jar}} when users specify {{lib}} in ship files. Option3 is worthwhile to mention as if users specify {{usrlib}} in ship files, files in {{usrlib}} will be added into system classpaths but if users use child-first resolve order, files in {{usrlib}} will also be loaded by UserClassLoader as they are in userClassPath as well. Bad things happen If users choose parent-first resolve order, files in {{usrlib}} will be loaded by AppClassLoader which breaks the design. So, in summary, I think skipping it is a better one. Q2: After checking codes about {{FileJobGraphRetriever}} and {{{}YarnJobClusterEntrypoint{}}}, I think we have prepared for using {{usrlib}} if we upload it to the cluster. Q3: I agree only when UserJarInclusion is DISABLED will {{usrlib}} take effect in per-job mode. But currently default value of UserJarInclusion is {{ORDERED}} and works for all 3 modes(per job, session, app). If we agree the {{usrlib}} should be shipped automatically, we may need to consider the default value of this option if we want to use UserClassLoader to load jars in {{{}usrlib{}}}. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445605#comment-17445605 ] Biao Geng commented on FLINK-24897: --- [~long jiang] , thanks a lot for your advice. I just focus on the local `usrlib` but your solution about supporting hdfs is also a good point. I am also wondering when you implement the `usrlib` by yourself, will you guys ship it by default or ask the user to specify it. especially when you support directories in hdfs? > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24494) Avro Confluent Registry SQL kafka connector fails to write to the topic with schema
[ https://issues.apache.org/jira/browse/FLINK-24494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437208#comment-17437208 ] Biao Geng commented on FLINK-24494: --- Hi [~mahen], I have met this problem as well. I have implemented a simple catalog that will connect to Schema Registry(SR) service to get the schema string of the topic in Kafka whose format is Confluent Avro. Also, I tried to utilize the 'AvroSchemaConverter#convertToDataType()' method to parse the Confluent Avro schema string from SR and it works well in flink sql job like 'SELECT * FROM xx_topic' . However, when I tried to insert some new records into the topic, by executing sql statements like 'INSERT INTO xx_topic SELECT * FROM xx_topic LIMIT 2', it is strange to see that in SR API interface, I found there is a new version of schema whose name is 'record' and the namespace is erased. That's the same with [~MartijnVisser]'s description. I wonder if there is any reason for hardcoding the record name in schema as 'record' and if we have any plan to fix this. Thank you very much. > Avro Confluent Registry SQL kafka connector fails to write to the topic with > schema > - > > Key: FLINK-24494 > URL: https://issues.apache.org/jira/browse/FLINK-24494 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / API >Affects Versions: 1.14.0, 1.13.2 >Reporter: Mahendran Ponnusamy >Priority: Critical > Attachments: Screen Shot 2021-10-12 at 10.38.55 AM.png, > image-2021-10-12-12-09-30-374.png, image-2021-10-12-12-11-04-664.png, > image-2021-10-12-12-18-53-016.png, image-2021-10-12-12-19-37-227.png, > image-2021-10-12-12-21-02-008.png, image-2021-10-12-12-21-46-177.png > > > *Summary:* > Given a schema registered to a topic with name and namespace > when the flink sql with upsert-kafka connector writes to the topic, > it fails coz row it tries to produce is not compatible with the schema > registered > > *Root cause:* > The upsert-kafka connector auto generates a schema with the +*name as > `record` and no namespace*+. The below schema is generated by the connector. > I'm expecting the connector should pull the schema from the subject and use > ConfluentAvroRowSerialization to[which is not there today i believe] > serialize using the schema from the subject. > Schema generated by the upsert-kafka connector which is using > AvroRowSerializer interanally > !image-2021-10-12-12-21-46-177.png|width=813,height=23! > {color:#cc7832}Schema Registered to the subject: {color} > { > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"record"{color}{color:#cc7832},{color} > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"SimpleCustomer"{color}{color:#cc7832},{color} > {color:#9876aa}"namespace" {color}{color:#cc7832}: > {color}{color:#6a8759}"com...example.model"{color}{color:#cc7832},{color} > {color:#9876aa}"fields" {color}{color:#cc7832}: {color}[ { > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"customerId"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"string"{color} }{color:#cc7832}, {color}{ > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"age"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"int"{color}{color:#cc7832},{color} > {color:#9876aa}"default" {color}{color:#cc7832}: > {color}{color:#6897bb}0{color} }] > } > > Table SQL with upsert-kafka connector > !image-2021-10-12-12-18-53-016.png|width=351,height=176! > > The name "record" hardcoded > !image-2021-10-12-12-21-02-008.png|width=464,height=138! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-24494) Avro Confluent Registry SQL kafka connector fails to write to the topic with schema
[ https://issues.apache.org/jira/browse/FLINK-24494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437208#comment-17437208 ] Biao Geng edited comment on FLINK-24494 at 11/2/21, 9:56 AM: - Hi [~mahen], I have met this problem as well. I have implemented a simple catalog that will connect to Schema Registry(SR) service to get the schema string of the topic in Kafka whose format is Confluent Avro. Also, I tried to utilize the 'AvroSchemaConverter#convertToDataType()' method to parse the Confluent Avro schema string from SR and so I can directly get the CatalogTable from my catalog. It works well in flink sql job like 'SELECT * FROM xx_topic' . However, when I tried to insert some new records into the topic, by executing sql statements like 'INSERT INTO xx_topic SELECT * FROM xx_topic LIMIT 2', thought the sql job works well, it is strange to see that in SR API interface, there is a new version of schema generated whose name is 'record' and namespace is erased. Same with [~MartijnVisser]'s description. I wonder if there is any reason for hardcoding the record name in schema as 'record'. Thank you very much. was (Author: bgeng777): Hi [~mahen], I have met this problem as well. I have implemented a simple catalog that will connect to Schema Registry(SR) service to get the schema string of the topic in Kafka whose format is Confluent Avro. Also, I tried to utilize the 'AvroSchemaConverter#convertToDataType()' method to parse the Confluent Avro schema string from SR and so I can directly get the CatalogTable from my catalog. It works well in flink sql job like 'SELECT * FROM xx_topic' . However, when I tried to insert some new records into the topic, by executing sql statements like 'INSERT INTO xx_topic SELECT * FROM xx_topic LIMIT 2', it is strange to see that in SR API interface, there is a new version of schema generated whose name is 'record' and the namespace is erased. I believe that's the same with [~MartijnVisser]'s description. I wonder if there is any reason for hardcoding the record name in schema as 'record' and if we have any plan to fix this. Thank you very much. > Avro Confluent Registry SQL kafka connector fails to write to the topic with > schema > - > > Key: FLINK-24494 > URL: https://issues.apache.org/jira/browse/FLINK-24494 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / API >Affects Versions: 1.14.0, 1.13.2 >Reporter: Mahendran Ponnusamy >Priority: Critical > Attachments: Screen Shot 2021-10-12 at 10.38.55 AM.png, > image-2021-10-12-12-09-30-374.png, image-2021-10-12-12-11-04-664.png, > image-2021-10-12-12-18-53-016.png, image-2021-10-12-12-19-37-227.png, > image-2021-10-12-12-21-02-008.png, image-2021-10-12-12-21-46-177.png > > > *Summary:* > Given a schema registered to a topic with name and namespace > when the flink sql with upsert-kafka connector writes to the topic, > it fails coz row it tries to produce is not compatible with the schema > registered > > *Root cause:* > The upsert-kafka connector auto generates a schema with the +*name as > `record` and no namespace*+. The below schema is generated by the connector. > I'm expecting the connector should pull the schema from the subject and use > ConfluentAvroRowSerialization to[which is not there today i believe] > serialize using the schema from the subject. > Schema generated by the upsert-kafka connector which is using > AvroRowSerializer interanally > !image-2021-10-12-12-21-46-177.png|width=813,height=23! > {color:#cc7832}Schema Registered to the subject: {color} > { > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"record"{color}{color:#cc7832},{color} > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"SimpleCustomer"{color}{color:#cc7832},{color} > {color:#9876aa}"namespace" {color}{color:#cc7832}: > {color}{color:#6a8759}"com...example.model"{color}{color:#cc7832},{color} > {color:#9876aa}"fields" {color}{color:#cc7832}: {color}[ { > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"customerId"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"string"{color} }{color:#cc7832}, {color}{ > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"age"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"int"{color}{color:#cc7832},{color} > {color:#9876aa}"default" {color}{color:#cc7832}: > {color}{color:#6897bb}0{color} }] > } > > Table SQL with upsert-kafka connector > !image-2021-10-12-12-18-53-016.png|width=351,height=176! > > The name "record" hardcoded >
[jira] [Comment Edited] (FLINK-24494) Avro Confluent Registry SQL kafka connector fails to write to the topic with schema
[ https://issues.apache.org/jira/browse/FLINK-24494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437208#comment-17437208 ] Biao Geng edited comment on FLINK-24494 at 11/2/21, 9:10 AM: - Hi [~mahen], I have met this problem as well. I have implemented a simple catalog that will connect to Schema Registry(SR) service to get the schema string of the topic in Kafka whose format is Confluent Avro. Also, I tried to utilize the 'AvroSchemaConverter#convertToDataType()' method to parse the Confluent Avro schema string from SR and so I can directly get the CatalogTable from my catalog. It works well in flink sql job like 'SELECT * FROM xx_topic' . However, when I tried to insert some new records into the topic, by executing sql statements like 'INSERT INTO xx_topic SELECT * FROM xx_topic LIMIT 2', it is strange to see that in SR API interface, there is a new version of schema generated whose name is 'record' and the namespace is erased. I believe that's the same with [~MartijnVisser]'s description. I wonder if there is any reason for hardcoding the record name in schema as 'record' and if we have any plan to fix this. Thank you very much. was (Author: bgeng777): Hi [~mahen], I have met this problem as well. I have implemented a simple catalog that will connect to Schema Registry(SR) service to get the schema string of the topic in Kafka whose format is Confluent Avro. Also, I tried to utilize the 'AvroSchemaConverter#convertToDataType()' method to parse the Confluent Avro schema string from SR and it works well in flink sql job like 'SELECT * FROM xx_topic' . However, when I tried to insert some new records into the topic, by executing sql statements like 'INSERT INTO xx_topic SELECT * FROM xx_topic LIMIT 2', it is strange to see that in SR API interface, I found there is a new version of schema whose name is 'record' and the namespace is erased. That's the same with [~MartijnVisser]'s description. I wonder if there is any reason for hardcoding the record name in schema as 'record' and if we have any plan to fix this. Thank you very much. > Avro Confluent Registry SQL kafka connector fails to write to the topic with > schema > - > > Key: FLINK-24494 > URL: https://issues.apache.org/jira/browse/FLINK-24494 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / API >Affects Versions: 1.14.0, 1.13.2 >Reporter: Mahendran Ponnusamy >Priority: Critical > Attachments: Screen Shot 2021-10-12 at 10.38.55 AM.png, > image-2021-10-12-12-09-30-374.png, image-2021-10-12-12-11-04-664.png, > image-2021-10-12-12-18-53-016.png, image-2021-10-12-12-19-37-227.png, > image-2021-10-12-12-21-02-008.png, image-2021-10-12-12-21-46-177.png > > > *Summary:* > Given a schema registered to a topic with name and namespace > when the flink sql with upsert-kafka connector writes to the topic, > it fails coz row it tries to produce is not compatible with the schema > registered > > *Root cause:* > The upsert-kafka connector auto generates a schema with the +*name as > `record` and no namespace*+. The below schema is generated by the connector. > I'm expecting the connector should pull the schema from the subject and use > ConfluentAvroRowSerialization to[which is not there today i believe] > serialize using the schema from the subject. > Schema generated by the upsert-kafka connector which is using > AvroRowSerializer interanally > !image-2021-10-12-12-21-46-177.png|width=813,height=23! > {color:#cc7832}Schema Registered to the subject: {color} > { > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"record"{color}{color:#cc7832},{color} > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"SimpleCustomer"{color}{color:#cc7832},{color} > {color:#9876aa}"namespace" {color}{color:#cc7832}: > {color}{color:#6a8759}"com...example.model"{color}{color:#cc7832},{color} > {color:#9876aa}"fields" {color}{color:#cc7832}: {color}[ { > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"customerId"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"string"{color} }{color:#cc7832}, {color}{ > {color:#9876aa}"name" {color}{color:#cc7832}: > {color}{color:#6a8759}"age"{color}{color:#cc7832},{color} > {color:#9876aa}"type" {color}{color:#cc7832}: > {color}{color:#6a8759}"int"{color}{color:#cc7832},{color} > {color:#9876aa}"default" {color}{color:#cc7832}: > {color}{color:#6897bb}0{color} }] > } > > Table SQL with upsert-kafka connector > !image-2021-10-12-12-18-53-016.png|width=351,height=176! > > The name "record" hardcoded > !image-2021-10-12-12-21-02-008.png|width=464,height=138!
[jira] [Created] (FLINK-24682) Unify the -C option behavior in both yarn application and per-job mode
Biao Geng created FLINK-24682: - Summary: Unify the -C option behavior in both yarn application and per-job mode Key: FLINK-24682 URL: https://issues.apache.org/jira/browse/FLINK-24682 Project: Flink Issue Type: Improvement Components: Deployment / YARN Affects Versions: 1.12.3 Environment: flink 1.12.3 yarn 2.8.5 Reporter: Biao Geng Recently, when switching the job submission mode from per-job mode to application mode on yarn, we found the behavior of '-C' ('–-classpath') is somehow misleading: In per-job mode, the `main()` method of the program is executed in the local machine and '-C' option works well when we use it to specify some local user jars like -C file://xx.jar. But in application mode, this option works differently: as the `main()` method will be executed on the job manager in the cluster, it is unclear where the url like `file://xx.jar` points. It seems that `file://xx.jar` is located on the job manager machine in the cluster due to the code. If that is true, it may mislead users as in per-job mode, it refers to the the jars in the client machine. In summary, if we can unify the -C option behavior in both yarn application and per-job mode, it would help users to switch to application mode more smoothly and more importantly, it makes it much easier to specify some local jars, that should be loaded by UserClassLoader, on the client machine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24682) Unify the -C option behavior in both yarn application and per-job mode
[ https://issues.apache.org/jira/browse/FLINK-24682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440335#comment-17440335 ] Biao Geng commented on FLINK-24682: --- Hi, [~trohrmann] [~xtsong], would you mind leaving any comments on this issue? Thanks > Unify the -C option behavior in both yarn application and per-job mode > -- > > Key: FLINK-24682 > URL: https://issues.apache.org/jira/browse/FLINK-24682 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.12.3 > Environment: flink 1.12.3 > yarn 2.8.5 >Reporter: Biao Geng >Priority: Major > > Recently, when switching the job submission mode from per-job mode to > application mode on yarn, we found the behavior of '-C' ('–-classpath') is > somehow misleading: > In per-job mode, the `main()` method of the program is executed in the local > machine and '-C' option works well when we use it to specify some local user > jars like -C file://xx.jar. > But in application mode, this option works differently: as the `main()` > method will be executed on the job manager in the cluster, it is unclear > where the url like `file://xx.jar` points. It seems that > `file://xx.jar` is located on the job manager machine in the cluster due > to the code. If that is true, it may mislead users as in per-job mode, it > refers to the the jars in the client machine. > In summary, if we can unify the -C option behavior in both yarn application > and per-job mode, it would help users to switch to application mode more > smoothly and more importantly, it makes it much easier to specify some local > jars, that should be loaded by UserClassLoader, on the client machine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26605) JobManagerDeploymentStatus.READY does not correctly reflect the real Flink cluster status
[ https://issues.apache.org/jira/browse/FLINK-26605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505984#comment-17505984 ] Biao Geng commented on FLINK-26605: --- Hi [~wangyang0918], the simple rest call for the state check of the cluster makes sense and I am willing to take this issue about robustness. Would you mind assigning it to me? > JobManagerDeploymentStatus.READY does not correctly reflect the real Flink > cluster status > - > > Key: FLINK-26605 > URL: https://issues.apache.org/jira/browse/FLINK-26605 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > > Follow the discussion in the PR[1]. > {{JobManagerDeploymentStatus.READY}} does not mean the Flink cluster is ready > for accepting REST API calls. This will cause problems when we are running > session cluster. > For example, if something is wrong or very slow with the leader election, > {{JobManagerDeploymentStatus}} is {{READY}} but the session cluster is not > ready for accepting job submission. What I mean is to update the > {{JobManagerDeploymentStatus}} to {{READY}} only when the flink cluster is > actually working. > > [1]. > https://github.com/apache/flink-kubernetes-operator/pull/51#discussion_r824359419 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26702) Sporadic failures in JobObserverTest
[ https://issues.apache.org/jira/browse/FLINK-26702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508100#comment-17508100 ] Biao Geng commented on FLINK-26702: --- Hi [~mbalassi] you are not alone. I have also seen this failure in my openning PR's CI in my own github in the morning. I am not so sure why my CI is triggered. But I happened to set the build config of in my own CI to show more info for other issues and I got the full error messgae of this failure: {quote}[INFO] Running org.apache.flink.kubernetes.operator.observer.JobObserverTest 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Getting job statuses for test-cluster 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Job statuses updated for test-cluster 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Getting job statuses for test-cluster 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Job statuses updated for test-cluster 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Getting job statuses for test-cluster 2022-03-17 03:23:31,091 o.a.f.k.o.o.JobObserver [INFO ] [.] Job statuses updated for test-cluster 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] JobManager deployment test-cluster in namespace flink-operator-test exists but not ready, status DeploymentStatus(availableReplicas=1, collisionCount=null, conditions=[], observedGeneration=null, readyReplicas=null, replicas=1, unavailableReplicas=null, updatedReplicas=null, additionalProperties={}) 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] JobManager deployment test-cluster in namespace flink-operator-test exists but not ready, status DeploymentStatus(availableReplicas=1, collisionCount=null, conditions=[], observedGeneration=null, readyReplicas=null, replicas=1, unavailableReplicas=null, updatedReplicas=null, additionalProperties={}) 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] JobManager deployment test-cluster in namespace flink-operator-test port ready, waiting for the REST API... 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] Getting job statuses for test-cluster 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] Job statuses updated for test-cluster 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] Getting job statuses for test-cluster 2022-03-17 03:23:31,093 o.a.f.k.o.o.JobObserver [INFO ] [.] Job statuses updated for test-cluster Error: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.039 s <<< FAILURE! - in org.apache.flink.kubernetes.operator.observer.JobObserverTest Error: org.apache.flink.kubernetes.operator.observer.JobObserverTest.observeApplicationCluster Time elapsed: 0.013 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <0> but was: <1> at org.apache.flink.kubernetes.operator.observer.JobObserverTest.observeApplicationCluster(JobObserverTest.java:99) {quote} > Sporadic failures in JobObserverTest > > > Key: FLINK-26702 > URL: https://issues.apache.org/jira/browse/FLINK-26702 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Márton Balassi >Priority: Major > > I have occasionally observed the following failure during the regular build: > > {code:java} > mvn clean install > ... > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.244 > s - in org.apache.flink.kubernetes.operator.service.FlinkServiceTest > [INFO] Running > org.apache.flink.kubernetes.operator.validation.DeploymentValidatorTest > [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 > s - in org.apache.flink.kubernetes.operator.validation.DeploymentValidatorTest > [INFO] > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] JobObserverTest.observeApplicationCluster:99 expected: <0> but was: > <1> > [INFO] > [ERROR] Tests run: 34, Failures: 1, Errors: 0, Skipped: 0 > [INFO] > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Flink Kubernetes: .. SUCCESS [ 4.223 > s] > [INFO] Flink Kubernetes Shaded SUCCESS [ 5.097 > s] > [INFO] Flink Kubernetes Operator .. FAILURE [ 34.596 > s] > [INFO] Flink Kubernetes Webhook ... SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 44.065 s > [INFO] Finished at: 2022-03-17T09:43:22+01:00 > [INFO] Final Memory: 160M/554M > [INFO] > > [ERROR] Failed to execute goal >
[jira] [Created] (FLINK-26671) Update 'Developer Guide' in README
Biao Geng created FLINK-26671: - Summary: Update 'Developer Guide' in README Key: FLINK-26671 URL: https://issues.apache.org/jira/browse/FLINK-26671 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Biao Geng Attachments: image-2022-03-16-14-05-56-936.png The shell command in README is now based on the root dir of the repo. In Developer Guide, we should update the {{helm install ...}} to {{helm install flink-operator helm/flink-operator --set image.repository=/flink-java-operator --set image.tag=latest}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26655) Merge isJobManagerPortReady and isJobManagerServing to check if JM pod can work correctly
Biao Geng created FLINK-26655: - Summary: Merge isJobManagerPortReady and isJobManagerServing to check if JM pod can work correctly Key: FLINK-26655 URL: https://issues.apache.org/jira/browse/FLINK-26655 Project: Flink Issue Type: Sub-task Reporter: Biao Geng We now consider that JM pod can have 2 possible states after launching: # JM is launched but port is not ready. # JM is launched, port is ready but rest service is not ready. It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready. With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26605) Check if JM can serve rest api calls every time before reconcile
[ https://issues.apache.org/jira/browse/FLINK-26605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26605: -- Summary: Check if JM can serve rest api calls every time before reconcile (was: Improve the observe logic in SessionObserver) > Check if JM can serve rest api calls every time before reconcile > > > Key: FLINK-26605 > URL: https://issues.apache.org/jira/browse/FLINK-26605 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Follow the discussion in the PR[1]. > {{JobManagerDeploymentStatus.READY}} does not mean the Flink cluster is ready > for accepting REST API calls. This will cause problems when we are running > session cluster. > For example, if something is wrong or very slow with the leader election, > {{JobManagerDeploymentStatus}} is {{READY}} but the session cluster is not > ready for accepting job submission. What I mean is to update the > {{JobManagerDeploymentStatus}} to {{READY}} only when the flink cluster is > actually working. > Another case is even though JobManager crashed backoff, the > {{JobManagerDeploymentStatus}} is still {{READY}}. > > [1]. > https://github.com/apache/flink-kubernetes-operator/pull/51#discussion_r824359419 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26655) Improve the observe logic in SessionObserver
[ https://issues.apache.org/jira/browse/FLINK-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26655: -- Summary: Improve the observe logic in SessionObserver (was: Merge isJobManagerPortReady and isJobManagerServing to check if JM pod can work correctly) > Improve the observe logic in SessionObserver > > > Key: FLINK-26655 > URL: https://issues.apache.org/jira/browse/FLINK-26655 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > -We now consider that JM pod can have 2 possible states after launching:- > # -JM is launched but port is not ready.- > # -JM is launched, port is ready but rest service is not ready.- > -It looks that they can be merged as what we really care is if the JM can > serve REST calls correctly, not if the JM port is ready.- > -With above observation, we can merge {{isJobManagerPortReady}} and > {{isJobManagerServing}} to check if JM pod can serve correctly.- -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26655) Merge isJobManagerPortReady and isJobManagerServing to check if JM pod can work correctly
[ https://issues.apache.org/jira/browse/FLINK-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26655: -- Description: -We now consider that JM pod can have 2 possible states after launching:- # -JM is launched but port is not ready.- # -JM is launched, port is ready but rest service is not ready.- -It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready.- -With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly.- was: We now consider that JM pod can have 2 possible states after launching: # JM is launched but port is not ready. # JM is launched, port is ready but rest service is not ready. It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready. With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly. > Merge isJobManagerPortReady and isJobManagerServing to check if JM pod can > work correctly > - > > Key: FLINK-26655 > URL: https://issues.apache.org/jira/browse/FLINK-26655 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > -We now consider that JM pod can have 2 possible states after launching:- > # -JM is launched but port is not ready.- > # -JM is launched, port is ready but rest service is not ready.- > -It looks that they can be merged as what we really care is if the JM can > serve REST calls correctly, not if the JM port is ready.- > -With above observation, we can merge {{isJobManagerPortReady}} and > {{isJobManagerServing}} to check if JM pod can serve correctly.- -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26605) Improve the observe logic in SessionObserver
[ https://issues.apache.org/jira/browse/FLINK-26605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26605: -- Summary: Improve the observe logic in SessionObserver (was: JobManagerDeploymentStatus.READY does not correctly reflect the real Flink cluster status) > Improve the observe logic in SessionObserver > > > Key: FLINK-26605 > URL: https://issues.apache.org/jira/browse/FLINK-26605 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Follow the discussion in the PR[1]. > {{JobManagerDeploymentStatus.READY}} does not mean the Flink cluster is ready > for accepting REST API calls. This will cause problems when we are running > session cluster. > For example, if something is wrong or very slow with the leader election, > {{JobManagerDeploymentStatus}} is {{READY}} but the session cluster is not > ready for accepting job submission. What I mean is to update the > {{JobManagerDeploymentStatus}} to {{READY}} only when the flink cluster is > actually working. > Another case is even though JobManager crashed backoff, the > {{JobManagerDeploymentStatus}} is still {{READY}}. > > [1]. > https://github.com/apache/flink-kubernetes-operator/pull/51#discussion_r824359419 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26655) Improve the observe logic in SessionObserver
[ https://issues.apache.org/jira/browse/FLINK-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26655: -- Description: -We now consider that JM pod can have 2 possible states after launching:- # -JM is launched but port is not ready.- # -JM is launched, port is ready but rest service is not ready.- -It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready.- -With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly.- Following the discussion in PR [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we currently discard {{isJobManagerServing}} and use the {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to make sure job manager can serve rest call correctly. The original question of this jira is solved. Now I adjust this jira to track the improvement of the observe logic in SessionObserver due to the review comments. was: -We now consider that JM pod can have 2 possible states after launching:- # -JM is launched but port is not ready.- # -JM is launched, port is ready but rest service is not ready.- -It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready.- -With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly.- Following the discussion in PR [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we currently discard {{isJobManagerServing}} and use the {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to make sure job manager can serve rest call correctly. The original question of this jira is solved. Now I adjust this jira to track the improvement of the > Improve the observe logic in SessionObserver > > > Key: FLINK-26655 > URL: https://issues.apache.org/jira/browse/FLINK-26655 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > -We now consider that JM pod can have 2 possible states after launching:- > # -JM is launched but port is not ready.- > # -JM is launched, port is ready but rest service is not ready.- > -It looks that they can be merged as what we really care is if the JM can > serve REST calls correctly, not if the JM port is ready.- > -With above observation, we can merge {{isJobManagerPortReady}} and > {{isJobManagerServing}} to check if JM pod can serve correctly.- > > Following the discussion in PR > [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we > currently discard {{isJobManagerServing}} and use the > {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to > make sure job manager can serve rest call correctly. The original question of > this jira is solved. > Now I adjust this jira to track the improvement of the observe logic in > SessionObserver due to the review comments. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26655) Improve the observe logic in SessionObserver
[ https://issues.apache.org/jira/browse/FLINK-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26655: -- Description: -We now consider that JM pod can have 2 possible states after launching:- # -JM is launched but port is not ready.- # -JM is launched, port is ready but rest service is not ready.- -It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready.- -With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly.- Following the discussion in PR [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we currently discard {{isJobManagerServing}} and use the {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to make sure job manager can serve rest call correctly. The original question of this jira is solved. Now I adjust this jira to track the improvement of the was: -We now consider that JM pod can have 2 possible states after launching:- # -JM is launched but port is not ready.- # -JM is launched, port is ready but rest service is not ready.- -It looks that they can be merged as what we really care is if the JM can serve REST calls correctly, not if the JM port is ready.- -With above observation, we can merge {{isJobManagerPortReady}} and {{isJobManagerServing}} to check if JM pod can serve correctly.- > Improve the observe logic in SessionObserver > > > Key: FLINK-26655 > URL: https://issues.apache.org/jira/browse/FLINK-26655 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > -We now consider that JM pod can have 2 possible states after launching:- > # -JM is launched but port is not ready.- > # -JM is launched, port is ready but rest service is not ready.- > -It looks that they can be merged as what we really care is if the JM can > serve REST calls correctly, not if the JM port is ready.- > -With above observation, we can merge {{isJobManagerPortReady}} and > {{isJobManagerServing}} to check if JM pod can serve correctly.- > > Following the discussion in PR > [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we > currently discard {{isJobManagerServing}} and use the > {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to > make sure job manager can serve rest call correctly. The original question of > this jira is solved. > Now I adjust this jira to track the improvement of the -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26655) Improve the observe logic in SessionObserver
[ https://issues.apache.org/jira/browse/FLINK-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509168#comment-17509168 ] Biao Geng commented on FLINK-26655: --- I rewrite this jira after our previous discussion about the check of rest service. Now it is used to track the improvement of the observe logic in SessionObserver according to your review. cc [~wangyang0918] > Improve the observe logic in SessionObserver > > > Key: FLINK-26655 > URL: https://issues.apache.org/jira/browse/FLINK-26655 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > -We now consider that JM pod can have 2 possible states after launching:- > # -JM is launched but port is not ready.- > # -JM is launched, port is ready but rest service is not ready.- > -It looks that they can be merged as what we really care is if the JM can > serve REST calls correctly, not if the JM port is ready.- > -With above observation, we can merge {{isJobManagerPortReady}} and > {{isJobManagerServing}} to check if JM pod can serve correctly.- > > Following the discussion in PR > [62|https://github.com/apache/flink-kubernetes-operator/pull/62], we > currently discard {{isJobManagerServing}} and use the > {{isJobManagerPortReady}} together with a call to `flinkService.listJobs()`to > make sure job manager can serve rest call correctly. The original question of > this jira is solved. > Now I adjust this jira to track the improvement of the observe logic in > SessionObserver due to the review comments. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26163) Refactor FlinkUtils#getEffectiveConfig into smaller pieces
[ https://issues.apache.org/jira/browse/FLINK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493096#comment-17493096 ] Biao Geng edited comment on FLINK-26163 at 2/16/22, 9:18 AM: - hi [~gyfora] , I am happy to take this ticket to refactor the configuration generation. Would you mind assign it to me? Just happened to do something similar in the [[FLINK-26030|https://issues.apache.org/jira/browse/FLINK-26030]]. was (Author: bgeng777): hi [~gyfora] , I am happy to take this ticket to refactor the configuration generation. Would you mind assign it to me? Just happened to do something similar in the [FLINK-26030](https://issues.apache.org/jira/browse/FLINK-26030). > Refactor FlinkUtils#getEffectiveConfig into smaller pieces > -- > > Key: FLINK-26163 > URL: https://issues.apache.org/jira/browse/FLINK-26163 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes >Reporter: Gyula Fora >Priority: Major > > There is currently one very large method dealing with Configuration > management. This logic probably deserves it's own utility class and a few > more modular methods. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26163) Refactor FlinkUtils#getEffectiveConfig into smaller pieces
[ https://issues.apache.org/jira/browse/FLINK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493096#comment-17493096 ] Biao Geng commented on FLINK-26163: --- hi [~gyfora] , I am happy to take this ticket to refactor the configuration generation. Would you mind assign it to me? Just happened to do something similar in the [FLINK-26030](https://issues.apache.org/jira/browse/FLINK-26030). > Refactor FlinkUtils#getEffectiveConfig into smaller pieces > -- > > Key: FLINK-26163 > URL: https://issues.apache.org/jira/browse/FLINK-26163 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes >Reporter: Gyula Fora >Priority: Major > > There is currently one very large method dealing with Configuration > management. This logic probably deserves it's own utility class and a few > more modular methods. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26178) Use the same enum for expected and observed jobstate (JobState / JobStatus.state)
[ https://issues.apache.org/jira/browse/FLINK-26178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497233#comment-17497233 ] Biao Geng commented on FLINK-26178: --- Hi [~gyfora], seems this ticket has been opened for some days. I am happy to take this ticket and would you mind assigning it to me? > Use the same enum for expected and observed jobstate (JobState / > JobStatus.state) > - > > Key: FLINK-26178 > URL: https://issues.apache.org/jira/browse/FLINK-26178 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > > We should consolidate these two and maybe even use Flink's own job state enum > here if makes sense -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26178) Use the same enum for expected and observed jobstate (JobState / JobStatus.state)
[ https://issues.apache.org/jira/browse/FLINK-26178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497521#comment-17497521 ] Biao Geng edited comment on FLINK-26178 at 2/24/22, 4:52 PM: - Some thoughts in doing this ticket: As the description of this ticket said, Flink has its own {{org.apache.flink.api.common.JobStatus}} (including INITIALIZING, RUNNING, SUSPENDED, ...), which may be included in {{{}org.apache.flink.runtime.client.JobStatusMessage{}}}. But here I prefer to use the {{org.apache.flink.kubernetes.operator.crd.spec.JobState}} for 2 reasons: 1. {{crd.spec.JobState}} is mainly for reconciling in this k8s operator while {{api.common.JobStatus}} is used by JM in Flink to orchestrates the job. Thougth the reconciling process may be drived by the {{{}api.common.JobStatus{}}}, these 2 fields are used for different purpose and should be separated. 2. {{crd.spec.JobState}} currently only includes RUNNING and SUSPENDED. We should consider FLINK-26139 first before we draw the conclusion if {{org.apache.flink.api.common.JobStatus is all we need.}} was (Author: bgeng777): Some thoughts in doing this ticket: As the description of this ticket said, Flink has its own {{org.apache.flink.api.common.JobStatus}} (including INITIALIZING, RUNNING, SUSPENDED, ...), which may be included in {{{}org.apache.flink.runtime.client.JobStatusMessage{}}}. But here I prefer to use the {{org.apache.flink.kubernetes.operator.crd.spec.JobState}} for 2 reasons: 1. {{crd.spec.JobState}} is mainly for reconciling in this k8s operator while {{api.common.JobStatus}} is used by JM in Flink to orchestrates the job. Thougth the reconciling process may be drived by the {{{}api.common.JobStatus{}}}, these 2 fields are used for different purpose and should be separated. 2. {{crd.spec.JobState}} currently only includes RUNNING and SUSPENDED. More state will increase the complexity of the reconciling quickly, which should be considered carefully and only when we find some strong cases. > Use the same enum for expected and observed jobstate (JobState / > JobStatus.state) > - > > Key: FLINK-26178 > URL: https://issues.apache.org/jira/browse/FLINK-26178 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Biao Geng >Priority: Major > > We should consolidate these two and maybe even use Flink's own job state enum > here if makes sense -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26178) Use the same enum for expected and observed jobstate (JobState / JobStatus.state)
[ https://issues.apache.org/jira/browse/FLINK-26178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497521#comment-17497521 ] Biao Geng edited comment on FLINK-26178 at 2/24/22, 4:34 PM: - Some thoughts in doing this ticket: As the description of this ticket said, Flink has its own {{org.apache.flink.api.common.JobStatus}} (including INITIALIZING, RUNNING, SUSPENDED, ...), which may be included in {{{}org.apache.flink.runtime.client.JobStatusMessage{}}}. But here I prefer to use the {{org.apache.flink.kubernetes.operator.crd.spec.JobState}} for 2 reasons: 1. {{crd.spec.JobState}} is mainly for reconciling in this k8s operator while {{api.common.JobStatus}} is used by JM in Flink to orchestrates the job. Thougth the reconciling process may be drived by the {{{}api.common.JobStatus{}}}, these 2 fields are used for different purpose and should be separated. 2. {{crd.spec.JobState}} currently only includes RUNNING and SUSPENDED. More state will increase the complexity of the reconciling quickly, which should be considered carefully and only when we find some strong cases. was (Author: bgeng777): Some thoughts in doing this ticket: As the description of this ticket said, Flink has its own {{org.apache.flink.api.common.JobStatus}} (including INITIALIZING, RUNNING, SUSPENDED, ...), which may be included in {{org.apache.flink.runtime.client.JobStatusMessage}}. But here I prefer to use the {{org.apache.flink.kubernetes.operator.crd.spec.JobState}} for 2 reasons: 1. {{crd.spec.JobState}} is mainly for reconciling in this k8s operator while {{api.common.JobStatus}} is used by JM in Flink to orchestrates the job. Thought the reconciling process may be drived by the {{org.apache.flink.api.common.JobStatus}}, they are used for different purpose. 2. {{crd.spec.JobState}} currently only includes RUNNING and SUSPENDED. More state will increase the complexity of the reconciling quickly, which should be considered carefully and only when we find some strong cases. > Use the same enum for expected and observed jobstate (JobState / > JobStatus.state) > - > > Key: FLINK-26178 > URL: https://issues.apache.org/jira/browse/FLINK-26178 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Biao Geng >Priority: Major > > We should consolidate these two and maybe even use Flink's own job state enum > here if makes sense -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26178) Use the same enum for expected and observed jobstate (JobState / JobStatus.state)
[ https://issues.apache.org/jira/browse/FLINK-26178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497521#comment-17497521 ] Biao Geng commented on FLINK-26178: --- Some thoughts in doing this ticket: As the description of this ticket said, Flink has its own {{org.apache.flink.api.common.JobStatus}} (including INITIALIZING, RUNNING, SUSPENDED, ...), which may be included in {{org.apache.flink.runtime.client.JobStatusMessage}}. But here I prefer to use the {{org.apache.flink.kubernetes.operator.crd.spec.JobState}} for 2 reasons: 1. {{crd.spec.JobState}} is mainly for reconciling in this k8s operator while {{api.common.JobStatus}} is used by JM in Flink to orchestrates the job. Thought the reconciling process may be drived by the {{org.apache.flink.api.common.JobStatus}}, they are used for different purpose. 2. {{crd.spec.JobState}} currently only includes RUNNING and SUSPENDED. More state will increase the complexity of the reconciling quickly, which should be considered carefully and only when we find some strong cases. > Use the same enum for expected and observed jobstate (JobState / > JobStatus.state) > - > > Key: FLINK-26178 > URL: https://issues.apache.org/jira/browse/FLINK-26178 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Biao Geng >Priority: Major > > We should consolidate these two and maybe even use Flink's own job state enum > here if makes sense -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26216) Make 'replicas' work in JobManager Spec
[ https://issues.apache.org/jira/browse/FLINK-26216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494981#comment-17494981 ] Biao Geng commented on FLINK-26216: --- hi [~ConradJam] thanks a lot for your words but I do not have the permission to assign this ticket. Before we ask help from others, I just want to mention that in the [Jira|https://issues.apache.org/jira/browse/FLINK-26163?filter=-1] of refactor FlinkUtils, I have added codes like: {quote}effectiveConfig.setInteger(KubernetesConfigOptions.KUBERNETES_JOBMANAGER_REPLICAS, spec.getJobManager().getReplicas()); {quote} And I verified it on my own k8s cluster and 2 JM can be started correctly. Please let me konw if that is your expected fix. If that is true, maybe we do not need to do it twice(sorry for the late update in this jira). If that is not your case, I believe we can ask others to assign this ticket to you. > Make 'replicas' work in JobManager Spec > --- > > Key: FLINK-26216 > URL: https://issues.apache.org/jira/browse/FLINK-26216 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > In our flink kubernetes operator's cr, we allow users to set the replica of > JobManager. > But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set > this value from the yaml file and as a result, the {{replicas}} will not work > and the default value(i.e. 1) will be applied. > Though we believe one JM with KubernetesHaService should be enough for most > HA cases, the {{replicas}} field of JM also makes sense since more than one > JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26405) Add validation check of num of JM replica
[ https://issues.apache.org/jira/browse/FLINK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26405: -- External issue URL: (was: https://issues.apache.org/jira/browse/FLINK-26216) > Add validation check of num of JM replica > - > > Key: FLINK-26405 > URL: https://issues.apache.org/jira/browse/FLINK-26405 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > When HA is enabled, the replicas of JM can be 1 or 2, > When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26405) Add validation check of num of JM replica
Biao Geng created FLINK-26405: - Summary: Add validation check of num of JM replica Key: FLINK-26405 URL: https://issues.apache.org/jira/browse/FLINK-26405 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng When HA is enabled, the replicas of JM can be 1 or 2, When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26405) Add validation check of num of JM replica
[ https://issues.apache.org/jira/browse/FLINK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26405: -- Parent: FLINK-25963 Issue Type: Sub-task (was: Improvement) > Add validation check of num of JM replica > - > > Key: FLINK-26405 > URL: https://issues.apache.org/jira/browse/FLINK-26405 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > When HA is enabled, the replicas of JM can be 1 or 2, > When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26405) Add validation check of num of JM replica
[ https://issues.apache.org/jira/browse/FLINK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26405: -- External issue URL: https://issues.apache.org/jira/browse/FLINK-26216 > Add validation check of num of JM replica > - > > Key: FLINK-26405 > URL: https://issues.apache.org/jira/browse/FLINK-26405 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > When HA is enabled, the replicas of JM can be 1 or 2, > When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26405) Add validation check of num of JM replica
[ https://issues.apache.org/jira/browse/FLINK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499295#comment-17499295 ] Biao Geng commented on FLINK-26405: --- Hi [~wangyang0918] I want to take this Jira to make the support of JM replicas more robust. Could you please assign this one to me? > Add validation check of num of JM replica > - > > Key: FLINK-26405 > URL: https://issues.apache.org/jira/browse/FLINK-26405 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > When HA is enabled, the replicas of JM can be 1 or 2, > When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26405) Add validation check of num of JM replica
[ https://issues.apache.org/jira/browse/FLINK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26405: -- Description: When HA is enabled, the replicas of JM can be 1 or more, When HA is not set, the replicas of JM should always be 1. was: When HA is enabled, the replicas of JM can be 1 or 2, When HA is not set, the replicas of JM should always be 1. > Add validation check of num of JM replica > - > > Key: FLINK-26405 > URL: https://issues.apache.org/jira/browse/FLINK-26405 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Minor > > When HA is enabled, the replicas of JM can be 1 or more, > When HA is not set, the replicas of JM should always be 1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26216) Make 'replicas' work in JobManager Spec
Biao Geng created FLINK-26216: - Summary: Make 'replicas' work in JobManager Spec Key: FLINK-26216 URL: https://issues.apache.org/jira/browse/FLINK-26216 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Biao Geng In our flink kubernetes operator's cr, we allow users to set the replica of JobManager. But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set this value from the yaml file and as a result, the {{replicas}} will not work and the default value(i.e. 1) will be applied. Though we believe one JM with KubernetesHaService should be enough for most HA cases, the {{replicas}} field of JM also makes sense since more than one JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26216) Make 'replicas' work in JobManager Spec
[ https://issues.apache.org/jira/browse/FLINK-26216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26216: -- Parent: FLINK-25963 Issue Type: Sub-task (was: Bug) > Make 'replicas' work in JobManager Spec > --- > > Key: FLINK-26216 > URL: https://issues.apache.org/jira/browse/FLINK-26216 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > In our flink kubernetes operator's cr, we allow users to set the replica of > JobManager. > But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set > this value from the yaml file and as a result, the {{replicas}} will not work > and the default value(i.e. 1) will be applied. > Though we believe one JM with KubernetesHaService should be enough for most > HA cases, the {{replicas}} field of JM also makes sense since more than one > JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26216) Make 'replicas' work in JobManager Spec
[ https://issues.apache.org/jira/browse/FLINK-26216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495930#comment-17495930 ] Biao Geng commented on FLINK-26216: --- [~gyfora] thanks; I will draft the pr ASAP. > Make 'replicas' work in JobManager Spec > --- > > Key: FLINK-26216 > URL: https://issues.apache.org/jira/browse/FLINK-26216 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Minor > > In our flink kubernetes operator's cr, we allow users to set the replica of > JobManager. > But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set > this value from the yaml file and as a result, the {{replicas}} will not work > and the default value(i.e. 1) will be applied. > Though we believe one JM with KubernetesHaService should be enough for most > HA cases, the {{replicas}} field of JM also makes sense since more than one > JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26139) Improve JobStatus tracking and handle different job states
[ https://issues.apache.org/jira/browse/FLINK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498139#comment-17498139 ] Biao Geng commented on FLINK-26139: --- After checking the job state transition in Flink, I think in this operator, for our flink application, we may only need to process 3 kinds of states: {{{}RUNNING{}}}, {{SUSPENDED}} and {{{}TERMINATED{}}}. {{{}RUNNING{}}}: The application cluster is running. The actual job status could be any of non_terminal state(i.e. INITIALIZING, CREATED, RUNNING, FAILING, CANCELLING, RESTARTING or RECONCILING). JM is responsible for managing the internal state transistion and the operator just consider it as running. {{{}TERMINATED{}}}: The application is finished due to the completion, failure or cancellation of the job. In this state, the operator should clean up the deployment and all other resources. No potential upgrades any more. {{{}SUSPENDED{}}}: From the API doc: "The job has been suspended which means that it has been stopped but not been removed from a potential HA job store." There could be potential upgrades happened in this state, like modifying the parallism and then the state can be tranferred to {{{}RUNNING{}}}. The state machine can be as simple as the following picture: !image-2022-02-25-21-22-08-636.png|width=200,height=119! I am a little uncertained that if we should split the {{Terminal}} state into more specific states like FAILED, FINISHED and CANCELLED so that we can track the status better in k8s side. Does above design make sense? cc [~wangyang0918] [~gyfora] > Improve JobStatus tracking and handle different job states > -- > > Key: FLINK-26139 > URL: https://issues.apache.org/jira/browse/FLINK-26139 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Attachments: image-2022-02-25-21-22-08-636.png > > > Currently we do not handle any job status changes such as cancellations, > errors or job completions. > We should introduce some mechanism to react and deal with these changes and > expose them in the status as they can potentially affect upgrades. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26139) Improve JobStatus tracking and handle different job states
[ https://issues.apache.org/jira/browse/FLINK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26139: -- Attachment: image-2022-02-25-21-22-08-636.png > Improve JobStatus tracking and handle different job states > -- > > Key: FLINK-26139 > URL: https://issues.apache.org/jira/browse/FLINK-26139 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Attachments: image-2022-02-25-21-22-08-636.png > > > Currently we do not handle any job status changes such as cancellations, > errors or job completions. > We should introduce some mechanism to react and deal with these changes and > expose them in the status as they can potentially affect upgrades. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26216) Make 'replicas' work in JobManager Spec
[ https://issues.apache.org/jira/browse/FLINK-26216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495290#comment-17495290 ] Biao Geng edited comment on FLINK-26216 at 2/21/22, 2:55 AM: - [~ConradJam] Thanks for you reply! You give a very valuable point and in my opinion, such straightforward config error should be raised in front instead of in reconcile phase, just like the target of this https://issues.apache.org/jira/browse/FLINK-26136 . I will try to finish my refactor PR with the simple fix of JM HA ASAP and for the validation part, I think you can either try to take FLINK-26136 or submit a following PR for this ticket only. I believe both choice will be encouraged by our community. was (Author: bgeng777): [~ConradJam] You give a very valuable point and in my opinion, such straightforward config error should be raised in front instead of in reconcile phase, just like the target of this https://issues.apache.org/jira/browse/FLINK-26136 . I will try to finish my refactor PR with the simple fix of JM HA ASAP and for the validation part, I think you can either try to take FLINK-26136 or submit a following PR for this ticket only. I believe both choice will be encouraged by our community. > Make 'replicas' work in JobManager Spec > --- > > Key: FLINK-26216 > URL: https://issues.apache.org/jira/browse/FLINK-26216 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > In our flink kubernetes operator's cr, we allow users to set the replica of > JobManager. > But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set > this value from the yaml file and as a result, the {{replicas}} will not work > and the default value(i.e. 1) will be applied. > Though we believe one JM with KubernetesHaService should be enough for most > HA cases, the {{replicas}} field of JM also makes sense since more than one > JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26216) Make 'replicas' work in JobManager Spec
[ https://issues.apache.org/jira/browse/FLINK-26216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495290#comment-17495290 ] Biao Geng commented on FLINK-26216: --- [~ConradJam] You give a very valuable point and in my opinion, such straightforward config error should be raised in front instead of in reconcile phase, just like the target of this https://issues.apache.org/jira/browse/FLINK-26136 . I will try to finish my refactor PR with the simple fix of JM HA ASAP and for the validation part, I think you can either try to take FLINK-26136 or submit a following PR for this ticket only. I believe both choice will be encouraged by our community. > Make 'replicas' work in JobManager Spec > --- > > Key: FLINK-26216 > URL: https://issues.apache.org/jira/browse/FLINK-26216 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > In our flink kubernetes operator's cr, we allow users to set the replica of > JobManager. > But in our {{FlinkUtils#getEffectiveConfig}} method, we currently not set > this value from the yaml file and as a result, the {{replicas}} will not work > and the default value(i.e. 1) will be applied. > Though we believe one JM with KubernetesHaService should be enough for most > HA cases, the {{replicas}} field of JM also makes sense since more than one > JM can reduce down time and make recovery of JM failure faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26825) Document State Machine Graph for JM Deployment
[ https://issues.apache.org/jira/browse/FLINK-26825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511100#comment-17511100 ] Biao Geng commented on FLINK-26825: --- Hi [~gyfora] would you mind assigning this ticket to me? > Document State Machine Graph for JM Deployment > -- > > Key: FLINK-26825 > URL: https://issues.apache.org/jira/browse/FLINK-26825 > Project: Flink > Issue Type: Sub-task >Reporter: Biao Geng >Priority: Major > > We may need a state machine graph for JM deployment or the whole cluster > like > [lyft’s|https://github.com/lyft/flinkk8soperator/blob/master/docs/state_machine.md] > in our document to help our developers refine our code later and our users > learn about how the oeprator reconciles. > The state machine is highly probably to be changed when we introduce more > features(e.g. rollback strategy) and we should update the doc when such > change happens. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26825) Document State Machine Graph for JM Deployment
Biao Geng created FLINK-26825: - Summary: Document State Machine Graph for JM Deployment Key: FLINK-26825 URL: https://issues.apache.org/jira/browse/FLINK-26825 Project: Flink Issue Type: Sub-task Reporter: Biao Geng We may need a state machine graph for JM deployment or the whole cluster like [lyft’s|https://github.com/lyft/flinkk8soperator/blob/master/docs/state_machine.md] in our document to help our developers refine our code later and our users learn about how the oeprator reconciles. The state machine is highly probably to be changed when we introduce more features(e.g. rollback strategy) and we should update the doc when such change happens. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26832) Output more status info for JobObserver
Biao Geng created FLINK-26832: - Summary: Output more status info for JobObserver Key: FLINK-26832 URL: https://issues.apache.org/jira/browse/FLINK-26832 Project: Flink Issue Type: Sub-task Reporter: Biao Geng For {{JobObserver#observeFlinkJobStatus()}}, we currently only {{logger.info("Job status successfully updated");}}. This is could be more informative if we output actual job status here to help users check the status of the Job due to flink operator's log, not only depending on the flink web ui. The proposed change looks like: {{logger.info("Job status successfully updated from {} to {}", currentState, targetState);}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26611) Document operator config options
[ https://issues.apache.org/jira/browse/FLINK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516254#comment-17516254 ] Biao Geng commented on FLINK-26611: --- [~gyfora] I see this jira is open and in the plan of 0.1.0 version. If no one has started on it, I can take it. > Document operator config options > > > Key: FLINK-26611 > URL: https://issues.apache.org/jira/browse/FLINK-26611 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-0.1.0 > > > Similar to how other flink configs are documented we should also document the > configs found in OperatorConfigOptions. > Generating the documentation might be possible but maybe it's easier to do it > manually. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor
Biao Geng created FLINK-27009: - Summary: Support SQL job submission in flink kubernetes opeartor Key: FLINK-27009 URL: https://issues.apache.org/jira/browse/FLINK-27009 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Biao Geng Currently, the flink kubernetes opeartor is for jar job using application or session cluster. For SQL job, there is no out of box solution in the operator. One simple and short-term solution is to wrap the SQL script into a jar job using table API with limitation. The long-term solution may work with [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve the full support. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513858#comment-17513858 ] Biao Geng commented on FLINK-26892: --- Hi [~gyfora], I see your point. The reason for I asking above question first is that I did some experiments and found the wrong config would not affect the deployment. Later I found it is because that I enabled webhook check, which would do a check first. I removed the webhook and applied some wrong config (e.g. JM replica = -1) and our {{reconcile() }} is stuck on the error. Your proposed solution is necessary for those who do not enable webhook check. > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513791#comment-17513791 ] Biao Geng commented on FLINK-26892: --- In our current implementation, validation error should only happen when users apply a changed yaml file with wrong configs, right? It seems that no bad things will happen: the triggered {{FlinkDeploymentController# reconcile()}} directly returns with no further reschedule. The operator keeps using previous configs and {{reconcile() }} logic(so, the {{reconcile()}} will be triggered after 15 secs and the state of the JM/job can be updated). Is anything I missed or strong point to trigger a new observe() once the yaml file is applied? > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513791#comment-17513791 ] Biao Geng edited comment on FLINK-26892 at 3/29/22, 4:09 AM: - In our current code, IIUC, validation error should only happen when users apply a changed yaml file with wrong configs, right? It seems that no bad things will happen: the triggered {{FlinkDeploymentController# reconcile()}} directly returns with no further reschedule. The operator keeps using previous configs and {{reconcile() }} logic(so, the {{reconcile()}} will be triggered after 15 secs and the state of the JM/job can be updated). Is anything I missed or strong point to trigger a new observe() once the yaml file is applied? was (Author: bgeng777): In our current implementation, validation error should only happen when users apply a changed yaml file with wrong configs, right? It seems that no bad things will happen: the triggered {{FlinkDeploymentController# reconcile()}} directly returns with no further reschedule. The operator keeps using previous configs and {{reconcile() }} logic(so, the {{reconcile()}} will be triggered after 15 secs and the state of the JM/job can be updated). Is anything I missed or strong point to trigger a new observe() once the yaml file is applied? > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513791#comment-17513791 ] Biao Geng edited comment on FLINK-26892 at 3/29/22, 4:10 AM: - In our current code, IIUC, validation error should only happen when users apply a changed yaml file with wrong configs, right? It seems that no bad things will happen: the triggered {{FlinkDeploymentController# reconcile()}} directly returns with no further reschedule. The operator keeps using previous configs and {{reconcile()}} logic(so, the {{reconcile()}} will be triggered after 15 secs and the state of the JM/job can be updated). Is anything I missed or strong point to trigger a new observe() once the yaml file is applied? was (Author: bgeng777): In our current code, IIUC, validation error should only happen when users apply a changed yaml file with wrong configs, right? It seems that no bad things will happen: the triggered {{FlinkDeploymentController# reconcile()}} directly returns with no further reschedule. The operator keeps using previous configs and {{reconcile() }} logic(so, the {{reconcile()}} will be triggered after 15 secs and the state of the JM/job can be updated). Is anything I missed or strong point to trigger a new observe() once the yaml file is applied? > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513971#comment-17513971 ] Biao Geng commented on FLINK-26892: --- I try to summary what we should do for this case: 1. Save latest validated config ({{lastReconciledSpec}} should have already been sufficient.) 2. Adjust current logic to observe with latest validated config -> validate new config -> reconcile with new config Is there any drawback of above idea? If it is fine, I can take this ticket and work on it. > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483689#comment-17483689 ] Biao Geng commented on FLINK-24897: --- hi [~wangyang0918], I have completed the [PR|https://github.com/apache/flink/pull/18531]. Would you mind reviewing it in some days? Thanks! Besides, I find that the method `ClusterEntrypointUtils#tryFindUserLibDirectory` may be buggy: it is a method in the ClusterEntrypointUtils which is utilized by Entrypoint classes to find if there are any `usrlib` in the standalone/k8s/yarn cluster. However, the code shows that it will use the `FLINK_LIB_DIR` to find if there are any `usrlib` in the remote. But in YARN, the uploaded `usrlib` is located in the YARN cache dir(e.g. appcache/application_xx/container_yy). As a result, if we set `FLINK_LIB_DIR` by default in the remote cluster, `tryFindUserLibDirectory` will try to find `usrlib` in wrong location. The reason for such bug not affecting correctness is that in most cases, `FLINK_LIB_DIR` will not be set in remote cluster's ENV and the current code then will use working dir which is System.getProperty("user.dir") by default. The working dir is the correct choice in most cases. I am not sure if this bug should be fixed in this PR or we should create a new one due to it will influence standalone/k8s/yarn clusterEntryPoint. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-24897) Make usrlib work for YARN application and per-job
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-24897: -- Summary: Make usrlib work for YARN application and per-job (was: Make usrlib could work for YARN application and per-job) > Make usrlib work for YARN application and per-job > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-24897) Make usrlib could work for YARN application and per-job
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446258#comment-17446258 ] Biao Geng edited comment on FLINK-24897 at 2/8/22, 11:01 AM: - Hi [~trohrmann] , would you mind giving any comments on if we have the promise that jars under {{usrlib}} will always be loaded by user classloader? Besides, after above discussion with Yang, the current solution is: 1. {{usrlib}} will be shipped automatically if it exists. 2. If we add {{usrlib}} in ship files again, we will throw exception. 3. {{usrlib}} will work for both per job and application mode. 4. Jars in {{usrlib}} will be loaded by user classloader only when UserJarInclusion is DISABLED. In other cases, AppClassLoader will be used. 5. {{usrlib}} directory should not exist in {{yarn.provided.lib.dirs}} or {{yarn.ship-files}}. If users want to ship {{usrlib}}, they should just create {{usrlib}} under FLINK_HOME dir. Thanks. was (Author: bgeng777): Hi [~trohrmann] , would you mind giving any comments on if we have the promise that jars under {{usrlib}} will always be loaded by user classloader? Besides, after above discussion with Yang, the current solution is: 1. {{usrlib}} will be shipped automatically if it exists. 2. If we add {{usrlib}} in ship files again, we will throw exception. 3. {{usrlib}} will work for both per job and application mode. 4. Jars in {{usrlib}} will be loaded by user classloader only when UserJarInclusion is DISABLED. In other cases, AppClassLoader will be used. Thanks. > Make usrlib could work for YARN application and per-job > --- > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26030: -- Component/s: Deployment / YARN > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26030: -- Summary: Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers (was: Set FLINK_LIB_DIR to lib under working dir in YARN containers) > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Improvement >Reporter: Biao Geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26030) Set FLINK_LIB_DIR to lib under working dir in YARN containers
Biao Geng created FLINK-26030: - Summary: Set FLINK_LIB_DIR to lib under working dir in YARN containers Key: FLINK-26030 URL: https://issues.apache.org/jira/browse/FLINK-26030 Project: Flink Issue Type: Improvement Reporter: Biao Geng -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26030: -- Description: Currently, we utilize {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} to locate usrlib in both flink client and cluster side. This method relies on the value of environment variable {{FLINK_LIB_DIR}} to find the {{usrlib}}. It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN cluster's containers, when we want to reuse this method to find {{usrlib}}, as the YARN usually starts the process using commands like bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b ... {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes will use current working dir to locate the {{usrlib}} which is correct in most cases. But bad things can happen if the machine which the YARN container resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that case, codes will try to find {{usrlib}} in a undesired place. > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Currently, we utilize > {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} > to locate usrlib in both flink client and cluster side. > This method relies on the value of environment variable {{FLINK_LIB_DIR}} to > find the {{usrlib}}. > It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} > will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN > cluster's containers, when we want to reuse this method to find {{usrlib}}, > as the YARN usually starts the process using commands like > bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 > -Xms1073741824 > -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint > -D jobmanager.memory.off-heap.size=134217728b -D > jobmanager.memory.jvm-overhead.min=201326592b -D > jobmanager.memory.jvm-metaspace.size=268435456b -D > jobmanager.memory.heap.size=1073741824b -D > jobmanager.memory.jvm-overhead.max=201326592b ... > {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes > will use current working dir to locate the {{usrlib}} which is correct in > most cases. But bad things can happen if the machine which the YARN container > resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that > case, codes will try to find {{usrlib}} in a undesired place. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26030: -- Description: Currently, we utilize {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} to locate usrlib in both flink client and cluster side. This method relies on the value of environment variable {{FLINK_LIB_DIR}} to find the {{usrlib}}. It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN cluster's containers, when we want to reuse this method to find {{usrlib}}, as the YARN usually starts the process using commands like bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b ... {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes will use current working dir to locate the {{usrlib}} which is correct in most cases. But bad things can happen if the machine which the YARN container resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that case, codes will try to find {{usrlib}} in a undesired place. One possible solution would be overriding the {{FLINK_LIB_DIR}} in YARN container env to the {{lib}} dir under YARN's workding dir. was: Currently, we utilize {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} to locate usrlib in both flink client and cluster side. This method relies on the value of environment variable {{FLINK_LIB_DIR}} to find the {{usrlib}}. It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN cluster's containers, when we want to reuse this method to find {{usrlib}}, as the YARN usually starts the process using commands like bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b ... {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes will use current working dir to locate the {{usrlib}} which is correct in most cases. But bad things can happen if the machine which the YARN container resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that case, codes will try to find {{usrlib}} in a undesired place. > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > Currently, we utilize > {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} > to locate usrlib in both flink client and cluster side. > This method relies on the value of environment variable {{FLINK_LIB_DIR}} to > find the {{usrlib}}. > It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} > will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN > cluster's containers, when we want to reuse this method to find {{usrlib}}, > as the YARN usually starts the process using commands like > bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 > -Xms1073741824 > -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint > -D jobmanager.memory.off-heap.size=134217728b -D > jobmanager.memory.jvm-overhead.min=201326592b -D > jobmanager.memory.jvm-metaspace.size=268435456b -D > jobmanager.memory.heap.size=1073741824b -D > jobmanager.memory.jvm-overhead.max=201326592b ... > {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes > will use current working dir to locate the {{usrlib}} which is correct in > most cases. But bad things can happen if the machine which the YARN container > resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that > case, codes will try to find {{usrlib}} in a undesired place. > One possible solution would be overriding the {{FLINK_LIB_DIR}} in YARN > container env to the {{lib}} dir under YARN's workding dir. -- This message was
[jira] [Created] (FLINK-26047) Support usrlib in HDFS for YARN application mode
Biao Geng created FLINK-26047: - Summary: Support usrlib in HDFS for YARN application mode Key: FLINK-26047 URL: https://issues.apache.org/jira/browse/FLINK-26047 Project: Flink Issue Type: Improvement Components: Deployment / YARN Reporter: Biao Geng In YARN Application mode, we currently support using user jar and lib jar from HDFS. For example, we can run commands like: {quote}./bin/flink run-application -t yarn-application \ -Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \ hdfs://myhdfs/jars/my-application.jar{quote} For {{usrlib}}, we currently only support local directory. I propose to add HDFS support for {{usrlib}} to work with CLASSPATH_INCLUDE_USER_JAR better. It can also benefit cases like using notebook to submit flink job. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483689#comment-17483689 ] Biao Geng edited comment on FLINK-24897 at 2/7/22, 9:05 AM: hi [~wangyang0918], I have completed the [PR|https://github.com/apache/flink/pull/18531]. Would you mind reviewing it in some days? Thanks! {{Besides, I find that the method `ClusterEntrypointUtils#tryFindUserLibDirectory` may be buggy: it is a method in the ClusterEntrypointUtils which is utilized by Entrypoint classes to find if there are any `usrlib` in the standalone/k8s/yarn cluster. However, the code shows that it will use the `FLINK_LIB_DIR` to find if there are any `usrlib` in the remote. But in YARN, the uploaded `usrlib` is located in the YARN cache dir(e.g. appcache/application_xx/container_yy). As a result, if we set `FLINK_LIB_DIR` by default in the remote cluster, `tryFindUserLibDirectory` will try to find `usrlib` in wrong location. The reason for such bug not affecting correctness is that in most cases, `FLINK_LIB_DIR` will not be set in remote cluster's ENV and the current code then will use working dir which is System.getProperty("user.dir") by default. The working dir is the correct choice in most cases. }} Add: discussed with Yang offline, and above behavior is by design. was (Author: bgeng777): hi [~wangyang0918], I have completed the [PR|https://github.com/apache/flink/pull/18531]. Would you mind reviewing it in some days? Thanks! Besides, I find that the method `ClusterEntrypointUtils#tryFindUserLibDirectory` may be buggy: it is a method in the ClusterEntrypointUtils which is utilized by Entrypoint classes to find if there are any `usrlib` in the standalone/k8s/yarn cluster. However, the code shows that it will use the `FLINK_LIB_DIR` to find if there are any `usrlib` in the remote. But in YARN, the uploaded `usrlib` is located in the YARN cache dir(e.g. appcache/application_xx/container_yy). As a result, if we set `FLINK_LIB_DIR` by default in the remote cluster, `tryFindUserLibDirectory` will try to find `usrlib` in wrong location. The reason for such bug not affecting correctness is that in most cases, `FLINK_LIB_DIR` will not be set in remote cluster's ENV and the current code then will use working dir which is System.getProperty("user.dir") by default. The working dir is the correct choice in most cases. I am not sure if this bug should be fixed in this PR or we should create a new one due to it will influence standalone/k8s/yarn clusterEntryPoint. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-24897) Enable application mode on YARN to use usrlib
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483689#comment-17483689 ] Biao Geng edited comment on FLINK-24897 at 2/7/22, 9:07 AM: hi [~wangyang0918], I have completed the [PR|https://github.com/apache/flink/pull/18531]. Would you mind reviewing it in some days? Thanks! {{Besides, I find that the method `ClusterEntrypointUtils#tryFindUserLibDirectory` may be buggy: it is a method in the ClusterEntrypointUtils which is utilized by Entrypoint classes to find if there are any `usrlib` in the standalone/k8s/yarn cluster. However, the code shows that it will use the `FLINK_LIB_DIR` to find if there are any `usrlib` in the remote. But in YARN, the uploaded `usrlib` is located in the YARN cache dir(e.g. appcache/application_xx/container_yy). As a result, if we set `FLINK_LIB_DIR` by default in the remote cluster, `tryFindUserLibDirectory` will try to find `usrlib` in wrong location. The reason for such bug not affecting correctness is that in most cases, `FLINK_LIB_DIR` will not be set in remote cluster's ENV and the current code then will use working dir which is System.getProperty("user.dir") by default. The working dir is the correct choice in most cases. }} was (Author: bgeng777): hi [~wangyang0918], I have completed the [PR|https://github.com/apache/flink/pull/18531]. Would you mind reviewing it in some days? Thanks! {{Besides, I find that the method `ClusterEntrypointUtils#tryFindUserLibDirectory` may be buggy: it is a method in the ClusterEntrypointUtils which is utilized by Entrypoint classes to find if there are any `usrlib` in the standalone/k8s/yarn cluster. However, the code shows that it will use the `FLINK_LIB_DIR` to find if there are any `usrlib` in the remote. But in YARN, the uploaded `usrlib` is located in the YARN cache dir(e.g. appcache/application_xx/container_yy). As a result, if we set `FLINK_LIB_DIR` by default in the remote cluster, `tryFindUserLibDirectory` will try to find `usrlib` in wrong location. The reason for such bug not affecting correctness is that in most cases, `FLINK_LIB_DIR` will not be set in remote cluster's ENV and the current code then will use working dir which is System.getProperty("user.dir") by default. The working dir is the correct choice in most cases. }} Add: discussed with Yang offline, and above behavior is by design. > Enable application mode on YARN to use usrlib > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489957#comment-17489957 ] Biao Geng commented on FLINK-26030: --- Yes, it should be a bug. I believe to be consistent with the behavior in flink client side, in YARN cluster side, we should make the {{FLINK_LIB_DIR}} in YARN container env to the {{lib}} dir under YARN's working dir. If that is the case, I may start investigating on how to implement it properly. Let me know if above assumption make sense. Thanks! > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Minor > > Currently, we utilize > {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} > to locate usrlib in both flink client and cluster side. > This method relies on the value of environment variable {{FLINK_LIB_DIR}} to > find the {{usrlib}}. > It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} > will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN > cluster's containers, when we want to reuse this method to find {{usrlib}}, > as the YARN usually starts the process using commands like > bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 > -Xms1073741824 > -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint > -D jobmanager.memory.off-heap.size=134217728b -D > jobmanager.memory.jvm-overhead.min=201326592b -D > jobmanager.memory.jvm-metaspace.size=268435456b -D > jobmanager.memory.heap.size=1073741824b -D > jobmanager.memory.jvm-overhead.max=201326592b ... > {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes > will use current working dir to locate the {{usrlib}} which is correct in > most cases. But bad things can happen if the machine which the YARN container > resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that > case, codes will try to find {{usrlib}} in a undesired place. > One possible solution would be overriding the {{FLINK_LIB_DIR}} in YARN > container env to the {{lib}} dir under YARN's workding dir. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26030) Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers
[ https://issues.apache.org/jira/browse/FLINK-26030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-26030: -- Description: Currently, we utilize {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} to locate usrlib in both flink client and cluster side. This method relies on the value of environment variable {{FLINK_LIB_DIR}} to find the {{{}usrlib{}}}. It makes sense in client side since in {{{}bin/config.sh{}}}, {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN cluster's containers, when we want to reuse this method to find {{{}usrlib{}}}, as the YARN usually starts the process using commands like {quote}/bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b ... {quote} {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes will use current working dir to locate the {{usrlib}} which is correct in most cases. But bad things can happen if the machine which the YARN container resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that case, codes will try to find {{usrlib}} in a undesired place. One possible solution would be overriding the {{FLINK_LIB_DIR}} in YARN container env to the {{lib}} dir under YARN's working dir. was: Currently, we utilize {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} to locate usrlib in both flink client and cluster side. This method relies on the value of environment variable {{FLINK_LIB_DIR}} to find the {{usrlib}}. It makes sense in client side since in {{bin/config.sh}}, {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not exists. But in YARN cluster's containers, when we want to reuse this method to find {{usrlib}}, as the YARN usually starts the process using commands like bq. /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b ... {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes will use current working dir to locate the {{usrlib}} which is correct in most cases. But bad things can happen if the machine which the YARN container resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that case, codes will try to find {{usrlib}} in a undesired place. One possible solution would be overriding the {{FLINK_LIB_DIR}} in YARN container env to the {{lib}} dir under YARN's workding dir. > Set FLINK_LIB_DIR to 'lib' under working dir in YARN containers > --- > > Key: FLINK-26030 > URL: https://issues.apache.org/jira/browse/FLINK-26030 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Minor > > Currently, we utilize > {{org.apache.flink.runtime.entrypoint.ClusterEntrypointUtils#tryFindUserLibDirectory}} > to locate usrlib in both flink client and cluster side. > This method relies on the value of environment variable {{FLINK_LIB_DIR}} to > find the {{{}usrlib{}}}. > It makes sense in client side since in {{{}bin/config.sh{}}}, > {{FLINK_LIB_DIR}} will be set by default(i.e. {{FLINK_HOME/lib}} if not > exists. But in YARN cluster's containers, when we want to reuse this method > to find {{{}usrlib{}}}, as the YARN usually starts the process using commands > like > {quote}/bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xmx1073741824 > -Xms1073741824 > -XX:MaxMetaspaceSize=268435456org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint > -D jobmanager.memory.off-heap.size=134217728b -D > jobmanager.memory.jvm-overhead.min=201326592b -D > jobmanager.memory.jvm-metaspace.size=268435456b -D > jobmanager.memory.heap.size=1073741824b -D > jobmanager.memory.jvm-overhead.max=201326592b ... > {quote} > {{FLINK_LIB_DIR}} is not guaranteed to be set in such case. Current codes > will use current working dir to locate the {{usrlib}} which is correct in > most cases. But bad things can happen if the machine which the YARN container > resides in has already set {{FLINK_LIB_DIR}} to a different folder. In that > case, codes will try to find {{usrlib}} in a
[jira] [Comment Edited] (FLINK-24897) Make usrlib work for YARN application and per-job
[ https://issues.apache.org/jira/browse/FLINK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446258#comment-17446258 ] Biao Geng edited comment on FLINK-24897 at 2/10/22, 2:35 AM: - Hi [~trohrmann] , would you mind giving any comments on if we have the promise that jars under {{usrlib}} will always be loaded by user classloader? Besides, after above discussion with Yang, the current solution is: 1. {{usrlib}} will be shipped automatically if it exists. 2. If we add {{usrlib}} in ship files again, we will throw exception. 3. {{usrlib}} will work for both per job and application mode. 4. Jars in {{usrlib}} will be loaded by user classloader only when UserJarInclusion is DISABLED. In other cases, AppClassLoader will be used. 5. {{usrlib}} directory should not exist in {{yarn.provided.lib.dirs}} or {{{}yarn.ship-files{}}}. If users want to ship {{{}usrlib{}}}, they should just create {{usrlib}} under FLINK_HOME dir. 6. key of CLASSPATH_INCLUDE_USER_JAR will be renamed from \{{yarn.per-job-cluster.include-user-jar}} to \{{yarn.classpath.include-user-jar}} for better naming. Thanks. was (Author: bgeng777): Hi [~trohrmann] , would you mind giving any comments on if we have the promise that jars under {{usrlib}} will always be loaded by user classloader? Besides, after above discussion with Yang, the current solution is: 1. {{usrlib}} will be shipped automatically if it exists. 2. If we add {{usrlib}} in ship files again, we will throw exception. 3. {{usrlib}} will work for both per job and application mode. 4. Jars in {{usrlib}} will be loaded by user classloader only when UserJarInclusion is DISABLED. In other cases, AppClassLoader will be used. 5. {{usrlib}} directory should not exist in {{yarn.provided.lib.dirs}} or {{yarn.ship-files}}. If users want to ship {{usrlib}}, they should just create {{usrlib}} under FLINK_HOME dir. Thanks. > Make usrlib work for YARN application and per-job > - > > Key: FLINK-24897 > URL: https://issues.apache.org/jira/browse/FLINK-24897 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > Labels: pull-request-available > > Hi there, > I am working to utilize application mode to submit flink jobs to YARN cluster > but I find that currently there is no easy way to ship my user-defined > jars(e.g. some custom connectors or udf jars that would be shared by some > jobs) and ask the FlinkUserCodeClassLoader to load classes in these jars. > I checked some relevant jiras, like FLINK-21289. In k8s mode, there is a > solution that users can use `usrlib` directory to store their user-defined > jars and these jars would be loaded by FlinkUserCodeClassLoader when the job > is executed on JM/TM. > But on YARN mode, `usrlib` does not work as that: > In this method(org.apache.flink.yarn.YarnClusterDescriptor#addShipFiles), if > I want to use `yarn.ship-files` to ship `usrlib` from my flink client(in my > local machine) to remote cluster, I must not set UserJarInclusion to > DISABLED due to the checkArgument(). However, if I do not set that option to > DISABLED, the user jars to be shipped will be added into systemClassPaths. As > a result, classes in those user jars will be loaded by AppClassLoader. > But if I do not ship these jars, there is no convenient way to utilize these > jars in my flink run command. Currently, all I can do seems to use `-C` > option, which means I have to upload my jars to some shared store first and > then use these remote paths. It is not so perfect as we have already make it > possible to ship jars or files directly and we also introduce `usrlib` in > application mode on YARN. It would be more user-friendly if we can allow > shipping `usrlib` from local to remote cluster while using > FlinkUserCodeClassLoader to load classes in the jars in `usrlib`. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26047) Support usrlib in HDFS for YARN application mode
[ https://issues.apache.org/jira/browse/FLINK-26047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489955#comment-17489955 ] Biao Geng commented on FLINK-26047: --- hi [~wangyang0918] Sure. I would take the ticket and start working ASAP. > Support usrlib in HDFS for YARN application mode > > > Key: FLINK-26047 > URL: https://issues.apache.org/jira/browse/FLINK-26047 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Biao Geng >Priority: Major > > In YARN Application mode, we currently support using user jar and lib jar > from HDFS. For example, we can run commands like: > {quote}./bin/flink run-application -t yarn-application \ > -Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \ > hdfs://myhdfs/jars/my-application.jar{quote} > For {{usrlib}}, we currently only support local directory. I propose to add > HDFS support for {{usrlib}} to work with CLASSPATH_INCLUDE_USER_JAR better. > It can also benefit cases like using notebook to submit flink job. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (FLINK-24682) Unify the -C option behavior in both yarn application and per-job mode
[ https://issues.apache.org/jira/browse/FLINK-24682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng resolved FLINK-24682. --- Resolution: Won't Do > Unify the -C option behavior in both yarn application and per-job mode > -- > > Key: FLINK-24682 > URL: https://issues.apache.org/jira/browse/FLINK-24682 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.12.3 > Environment: flink 1.12.3 > yarn 2.8.5 >Reporter: Biao Geng >Priority: Major > > Recently, when switching the job submission mode from per-job mode to > application mode on yarn, we found the behavior of '-C' ('–-classpath') is > somehow misleading: > In per-job mode, the `main()` method of the program is executed in the local > machine and '-C' option works well when we use it to specify some local user > jars like -C file://xx.jar. > But in application mode, this option works differently: as the `main()` > method will be executed on the job manager in the cluster, it is unclear > where the url like `file://xx.jar` points. It seems that > `file://xx.jar` is located on the job manager machine in the cluster due > to the code. If that is true, it may mislead users as in per-job mode, it > refers to the the jars in the client machine. > In summary, if we can unify the -C option behavior in both yarn application > and per-job mode, it would help users to switch to application mode more > smoothly and more importantly, it makes it much easier to specify some local > jars, that should be loaded by UserClassLoader, on the client machine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-24682) Unify the -C option behavior in both yarn application and per-job mode
[ https://issues.apache.org/jira/browse/FLINK-24682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489959#comment-17489959 ] Biao Geng commented on FLINK-24682: --- Per-job mode will be deprecate due to https://issues.apache.org/jira/browse/FLINK-25999 Some relevant use cases of loading user-specified jars with UserClassLoader can be achieved with https://issues.apache.org/jira/browse/FLINK-24897 Thus this Jira will be closed. > Unify the -C option behavior in both yarn application and per-job mode > -- > > Key: FLINK-24682 > URL: https://issues.apache.org/jira/browse/FLINK-24682 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.12.3 > Environment: flink 1.12.3 > yarn 2.8.5 >Reporter: Biao Geng >Priority: Major > > Recently, when switching the job submission mode from per-job mode to > application mode on yarn, we found the behavior of '-C' ('–-classpath') is > somehow misleading: > In per-job mode, the `main()` method of the program is executed in the local > machine and '-C' option works well when we use it to specify some local user > jars like -C file://xx.jar. > But in application mode, this option works differently: as the `main()` > method will be executed on the job manager in the cluster, it is unclear > where the url like `file://xx.jar` points. It seems that > `file://xx.jar` is located on the job manager machine in the cluster due > to the code. If that is true, it may mislead users as in per-job mode, it > refers to the the jars in the client machine. > In summary, if we can unify the -C option behavior in both yarn application > and per-job mode, it would help users to switch to application mode more > smoothly and more importantly, it makes it much easier to specify some local > jars, that should be loaded by UserClassLoader, on the client machine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Priority: Not a Priority (was: Major) > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Not a Priority > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518699#comment-17518699 ] Biao Geng commented on FLINK-27109: --- [~wangyang0918]I see your point. It's my bad to ignore the usage of the `watchNamespaces`. I will utilize this field. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Description: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. was: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check to avoid creating existed resource. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > One solution could be adding the namespace as a postfix in the name of > ClusterRole. > Another possible solution is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27109) The naming pattern of ClusterRole in Flink K8s operator should consider namespace
Biao Geng created FLINK-27109: - Summary: The naming pattern of ClusterRole in Flink K8s operator should consider namespace Key: FLINK-27109 URL: https://issues.apache.org/jira/browse/FLINK-27109 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check to avoid creating existed resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Description: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. Solution 1 could be adding the namespace as a postfix in the name of ClusterRole. Solution 2 is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. One important drawback of solution 2 is that when uninstalling one helm release, the created ClusterRole will be removed as well. was: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng closed FLINK-27109. - Resolution: Won't Fix > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Not a Priority > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Summary: Improve the creation of ClusterRole in Flink K8s operator (was: The naming pattern of ClusterRole in Flink K8s operator should consider namespace) > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > One solution could be adding the namespace as a postfix in the name of > ClusterRole. > Another possible solution is to add if else check to avoid creating existed > resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518673#comment-17518673 ] Biao Geng commented on FLINK-27109: --- cc [~morhidi] [~mbalassi] Do you have any suggestion? > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls
Biao Geng created FLINK-27329: - Summary: Add default value of replica of JM pod and remove declaring it in example yamls Key: FLINK-27329 URL: https://issues.apache.org/jira/browse/FLINK-27329 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for `replica` fields and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the `replica` field in the yaml file. In examples, we just use the new default valu(i.e. 1) should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27329: -- Description: Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for the `replica` field and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the value of `replica` field in the yaml file. In examples, we just use the new default value(i.e. 1), which should be fine. was: Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for `replica` fields and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the `replica` field in the yaml file. In examples, we just use the new default valu(i.e. 1) should be fine. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27329: -- Summary: Add default value of replica of JM pod and not declare it in example yamls (was: Add default value of replica of JM pod and remove declaring it in example yamls) > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for `replica` fields > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the `replica` field in > the yaml file. In examples, we just use the new default valu(i.e. 1) should > be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically
[ https://issues.apache.org/jira/browse/FLINK-27309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525012#comment-17525012 ] Biao Geng commented on FLINK-27309: --- Thanks [~gyfora]! Besides [~aitozi]'s valuable point, I also think about if the changed default configs should take effect for a running job deployment. >From k8s side, it may make sense that such change should be considered as >`specChanged`, and all the running job will be suspended and restarted. I am not sure it is proper to restart running jobs(i.e. changed default configs affect running jobs) from Flink users' side. > Allow to load default flink configs in the k8s operator dynamically > --- > > Key: FLINK-27309 > URL: https://issues.apache.org/jira/browse/FLINK-27309 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Gyula Fora >Priority: Major > > Current default configs used by the k8s operator will be saved in the > /opt/flink/conf dir in the k8s operator pod and will be loaded only once when > the operator is created. > Since the flink k8s operator could be a long running service and users may > want to modify the default configs(e.g the metric reporter sampling interval) > for newly created deployments, it may better to load the default configs > dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the > {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the > operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525020#comment-17525020 ] Biao Geng commented on FLINK-27329: --- cc [~wangyang0918] I can take this ticket~ > Add default value of replica of JM pod and remove declaring it in example > yamls > --- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for `replica` fields > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the `replica` field in > the yaml file. In examples, we just use the new default valu(i.e. 1) should > be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-24960) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots ha
[ https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524860#comment-17524860 ] Biao Geng commented on FLINK-24960: --- Hi [~mapohl], I tried to do some research but in latest [failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=34743=logs=298e20ef-7951-5965-0e79-ea664ddc435e=d4c90338-c843-57b0-3232-10ae74f00347=36026] posted by Yun, I did not find the {{Extracted hostname:port: }} was not shown. Though in the description of this ticket, it shows {{Extracted hostname:port: 5718b812c7ab:38622}} in the old CI test, which seems to be correct. I plan to verify {{yarnSessionClusterRunner.sendStop();}} works fine(i.e. the session cluster will be stopped normally) first but I have not found a way to run the cron test's "test_cron_jdk11 misc" test only on the my own [azure pipeline|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=109=results], which made the verification pretty slow and hard. Is there any guidelines about debugging the azure pipeline with some specific tests? > YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots > hangs on azure > --- > > Key: FLINK-24960 > URL: https://issues.apache.org/jira/browse/FLINK-24960 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.15.0, 1.14.3 >Reporter: Yun Gao >Assignee: Niklas Semmler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.16.0 > > > {code:java} > Nov 18 22:37:08 > > Nov 18 22:37:08 Test > testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase) > is running. > Nov 18 22:37:08 > > Nov 18 22:37:25 22:37:25,470 [main] INFO > org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted > hostname:port: 5718b812c7ab:38622 > Nov 18 22:52:36 > == > Nov 18 22:52:36 Process produced no output for 900 seconds. > Nov 18 22:52:36 > == > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722=logs=f450c1a5-64b1-5955-e215-49cb1ad5ec88=cc452273-9efa-565d-9db8-ef62a38a0c10=36395 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525432#comment-17525432 ] Biao Geng commented on FLINK-27329: --- [~wangyang0918] Reviewing all default values makes sense to me. I will try to check specs under \{{org.apache.flink.kubernetes.operator.crd.spec}} package to find if there are some fields like JM replica that do not have a proper default value. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test{*}*{*} pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test{*}*{*} pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM: It seems that this ITCase is not really executed due to its naming(not follow **Test * pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)