[jira] [Commented] (FLINK-33694) GCS filesystem does not respect gs.storage.root.url config option
[ https://issues.apache.org/jira/browse/FLINK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792901#comment-17792901 ] Patrick Lucas commented on FLINK-33694: --- I've updated the PR to more explicitly look for the specific Hadoop connector config for this option instead of defining it as a Flink {{{}ConfigOption{}}}. > GCS filesystem does not respect gs.storage.root.url config option > - > > Key: FLINK-33694 > URL: https://issues.apache.org/jira/browse/FLINK-33694 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.18.0, 1.17.2 >Reporter: Patrick Lucas >Priority: Major > Labels: gcs, pull-request-available > > The GCS FileSystem's RecoverableWriter implementation uses the GCS SDK > directly rather than going through Hadoop. While support has been added to > configure credentials correctly based on the standard Hadoop implementation > configuration, no other options are passed through to the underlying client. > Because this only affects the RecoverableWriter-related codepaths, it can > result in very surprising differing behavior whether the FileSystem is being > used as a source or a sink—while a {{{}gs://{}}}-URI FileSource may work > fine, a {{{}gs://{}}}-URI FileSink may not work at all. > We use [fake-gcs-server|https://github.com/fsouza/fake-gcs-server] in > testing, and so we override the Hadoop GCS FileSystem config option > {{{}gs.storage.root.url{}}}. However, because this option is not considered > when creating the GCS client for the RecoverableWriter codepath, in a > FileSink the GCS FileSystem attempts to write to the real GCS service rather > than fake-gcs-server. At the same time, a FileSource works as expected, > reading from fake-gcs-server. > The fix should be fairly straightforward, reading the {{gs.storage.root.url}} > config option from the Hadoop FileSystem config in > [{{GSFileSystemOptions}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemOptions.java#L30] > and, if set, passing it to {{storageOptionsBuilder}} in > [{{GSFileSystemFactory}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java]. > The only workaround for this is to build a custom flink-gs-fs-hadoop JAR with > a patch and use it as a plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33694) GCS filesystem does not respect gs.storage.root.url config option
[ https://issues.apache.org/jira/browse/FLINK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792050#comment-17792050 ] Patrick Lucas commented on FLINK-33694: --- [~martijnvisser] my comment is perhaps not precise, in that "support has been added to configure credentials correctly" should be qualified with the comment from the docs about which credentials mechanisms are supported. But it is still true that other potentially-interesting options are not proxied through, such as setting the root URL. This change as written doesn't affect any credentials handling, only adding support for this one additional option. However, I could see an argument for implementing the behavior in {{org.apache.flink.fs.gs.utils.ConfigUtils}} as the credentials behavior is rather than in {{GSFileSystemOptions}} as I did to start with. > GCS filesystem does not respect gs.storage.root.url config option > - > > Key: FLINK-33694 > URL: https://issues.apache.org/jira/browse/FLINK-33694 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.18.0, 1.17.2 >Reporter: Patrick Lucas >Priority: Major > Labels: gcs, pull-request-available > > The GCS FileSystem's RecoverableWriter implementation uses the GCS SDK > directly rather than going through Hadoop. While support has been added to > configure credentials correctly based on the standard Hadoop implementation > configuration, no other options are passed through to the underlying client. > Because this only affects the RecoverableWriter-related codepaths, it can > result in very surprising differing behavior whether the FileSystem is being > used as a source or a sink—while a {{{}gs://{}}}-URI FileSource may work > fine, a {{{}gs://{}}}-URI FileSink may not work at all. > We use [fake-gcs-server|https://github.com/fsouza/fake-gcs-server] in > testing, and so we override the Hadoop GCS FileSystem config option > {{{}gs.storage.root.url{}}}. However, because this option is not considered > when creating the GCS client for the RecoverableWriter codepath, in a > FileSink the GCS FileSystem attempts to write to the real GCS service rather > than fake-gcs-server. At the same time, a FileSource works as expected, > reading from fake-gcs-server. > The fix should be fairly straightforward, reading the {{gs.storage.root.url}} > config option from the Hadoop FileSystem config in > [{{GSFileSystemOptions}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemOptions.java#L30] > and, if set, passing it to {{storageOptionsBuilder}} in > [{{GSFileSystemFactory}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java]. > The only workaround for this is to build a custom flink-gs-fs-hadoop JAR with > a patch and use it as a plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33694) GCS filesystem does not respect gs.storage.root.url config option
[ https://issues.apache.org/jira/browse/FLINK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791670#comment-17791670 ] Patrick Lucas commented on FLINK-33694: --- [~martijnvisser] aye, have a PR open but Azure needs a re-run because I missed a license header. I'm not sure if flinkbot listens to me for commands. > GCS filesystem does not respect gs.storage.root.url config option > - > > Key: FLINK-33694 > URL: https://issues.apache.org/jira/browse/FLINK-33694 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.18.0, 1.17.2 >Reporter: Patrick Lucas >Priority: Major > Labels: gcs, pull-request-available > > The GCS FileSystem's RecoverableWriter implementation uses the GCS SDK > directly rather than going through Hadoop. While support has been added to > configure credentials correctly based on the standard Hadoop implementation > configuration, no other options are passed through to the underlying client. > Because this only affects the RecoverableWriter-related codepaths, it can > result in very surprising differing behavior whether the FileSystem is being > used as a source or a sink—while a {{{}gs://{}}}-URI FileSource may work > fine, a {{{}gs://{}}}-URI FileSink may not work at all. > We use [fake-gcs-server|https://github.com/fsouza/fake-gcs-server] in > testing, and so we override the Hadoop GCS FileSystem config option > {{{}gs.storage.root.url{}}}. However, because this option is not considered > when creating the GCS client for the RecoverableWriter codepath, in a > FileSink the GCS FileSystem attempts to write to the real GCS service rather > than fake-gcs-server. At the same time, a FileSource works as expected, > reading from fake-gcs-server. > The fix should be fairly straightforward, reading the {{gs.storage.root.url}} > config option from the Hadoop FileSystem config in > [{{GSFileSystemOptions}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemOptions.java#L30] > and, if set, passing it to {{storageOptionsBuilder}} in > [{{GSFileSystemFactory}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java]. > The only workaround for this is to build a custom flink-gs-fs-hadoop JAR with > a patch and use it as a plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33694) GCS filesystem does not respect gs.storage.root.url config option
Patrick Lucas created FLINK-33694: - Summary: GCS filesystem does not respect gs.storage.root.url config option Key: FLINK-33694 URL: https://issues.apache.org/jira/browse/FLINK-33694 Project: Flink Issue Type: Bug Components: FileSystems Affects Versions: 1.17.2, 1.18.0 Reporter: Patrick Lucas The GCS FileSystem's RecoverableWriter implementation uses the GCS SDK directly rather than going through Hadoop. While support has been added to configure credentials correctly based on the standard Hadoop implementation configuration, no other options are passed through to the underlying client. Because this only affects the RecoverableWriter-related codepaths, it can result in very surprising differing behavior whether the FileSystem is being used as a source or a sink—while a {{{}gs://{}}}-URI FileSource may work fine, a {{{}gs://{}}}-URI FileSink may not work at all. We use [fake-gcs-server|https://github.com/fsouza/fake-gcs-server] in testing, and so we override the Hadoop GCS FileSystem config option {{{}gs.storage.root.url{}}}. However, because this option is not considered when creating the GCS client for the RecoverableWriter codepath, in a FileSink the GCS FileSystem attempts to write to the real GCS service rather than fake-gcs-server. At the same time, a FileSource works as expected, reading from fake-gcs-server. The fix should be fairly straightforward, reading the {{gs.storage.root.url}} config option from the Hadoop FileSystem config in [{{GSFileSystemOptions}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemOptions.java#L30] and, if set, passing it to {{storageOptionsBuilder}} in [{{GSFileSystemFactory}}|https://github.com/apache/flink/blob/release-1.18.0/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java]. The only workaround for this is to build a custom flink-gs-fs-hadoop JAR with a patch and use it as a plugin. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32722) 239 exit code in flink-runtime
[ https://issues.apache.org/jira/browse/FLINK-32722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749863#comment-17749863 ] Patrick Lucas commented on FLINK-32722: --- Possibly related: FLINK-18290 > 239 exit code in flink-runtime > -- > > Key: FLINK-32722 > URL: https://issues.apache.org/jira/browse/FLINK-32722 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Affects Versions: 1.16.2 >Reporter: Matthias Pohl >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=51852=logs=d89de3df-4600-5585-dadc-9bbc9a5e661c=be5a4b15-4b23-56b1-7582-795f58a645a2=8418 > {code:java} > Aug 01 01:03:49 [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M5:test (default-test) > on project flink-runtime: There are test failures. > Aug 01 01:03:49 [ERROR] > Aug 01 01:03:49 [ERROR] Please refer to > /__w/2/s/flink-runtime/target/surefire-reports for the individual test > results. > Aug 01 01:03:49 [ERROR] Please refer to dump files (if any exist) > [date].dump, [date]-jvmRun[N].dump and [date].dumpstream. > Aug 01 01:03:49 [ERROR] ExecutionException The forked VM terminated without > properly saying goodbye. VM crash or System.exit called? > Aug 01 01:03:49 [ERROR] Command was /bin/sh -c cd /__w/2/s/flink-runtime && > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m > -jar > /__w/2/s/flink-runtime/target/surefire/surefirebooter1803120121827605294.jar > /__w/2/s/flink-runtime/target/surefire 2023-08-01T00-58-17_520-jvmRun1 > surefire9107652818401825168tmp surefire_261701267003130520249tmp > Aug 01 01:03:49 [ERROR] Error occurred in starting fork, check output in log > Aug 01 01:03:49 [ERROR] Process Exit Code: 239 > Aug 01 01:03:49 [ERROR] > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Aug 01 01:03:49 [ERROR] Command was /bin/sh -c cd /__w/2/s/flink-runtime && > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m > -jar > /__w/2/s/flink-runtime/target/surefire/surefirebooter1803120121827605294.jar > /__w/2/s/flink-runtime/target/surefire 2023-08-01T00-58-17_520-jvmRun1 > surefire9107652818401825168tmp surefire_261701267003130520249tmp > Aug 01 01:03:49 [ERROR] Error occurred in starting fork, check output in log > Aug 01 01:03:49 [ERROR] Process Exit Code: 239 > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) > Aug 01 01:03:49 [ERROR] at > org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216) > Aug 01
[jira] [Comment Edited] (FLINK-32583) RestClient can deadlock if request made after Netty event executor terminated
[ https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749829#comment-17749829 ] Patrick Lucas edited comment on FLINK-32583 at 8/1/23 2:33 PM: --- [~mapohl] regarding the release note, I just wanted to clarify something. {quote}Fixes a race condition in the RestClient that could happen when submitting a request while closing the client. There was a small chance that the request submission wouldn't complete. {quote} That is one effect of this change, but the primary problem it solves is not a race condition, but that if a request is made in _any time after_ the client is closed, then the future will definitely never be resolved, and no exception thrown. If a caller were to call {{.join()}} on the returned {{{}CompletableFuture{}}}—which does not take a timeout—the caller will actually block forever. I think this is the more serious bug being fixed here than the very rare edge case that required most of the complexity in the change. Suggested release note: {quote}Fixes a bug in the RestClient where the response future of a request made after the client was closed is never completed. {quote} or {quote}Fixes a bug in the RestClient where making a request after the client was closed returns a future that never completes. {quote} was (Author: plucas): [~mapohl] regarding the release note, I just wanted to clarify something. {quote}Fixes a race condition in the RestClient that could happen when submitting a request while closing the client. There was a small chance that the request submission wouldn't complete.{quote} That is one effect of this change, but the primary problem it solves is not a race condition, but that if a request is made in _any time after_ the client is closed, then the future will definitely never be resolved, and no exception thrown. If a caller were to call `.join()` on the returned `CompletableFuture`—which does not take a timeout—the caller will actually block forever. I think this is the more serious bug being fixed here than the very rare edge case that required most of the complexity in the change. Suggested release note: {quote}Fixes a bug in the RestClient where the response future of a request made after the client was closed is never completed.{quote} or {quote}Fixes a bug in the RestClient where making a request after the client was closed returns a future that never completes.{quote} > RestClient can deadlock if request made after Netty event executor terminated > - > > Key: FLINK-32583 > URL: https://issues.apache.org/jira/browse/FLINK-32583 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.18.0, 1.16.3, 1.17.2 >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > Labels: pull-request-available > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > The RestClient can deadlock if a request is made after the Netty event > executor has terminated. > This is due to the listener that would resolve the CompletableFuture that is > attached to the ChannelFuture returned by the call to Netty to connect not > being able to run because the executor to run it rejects the execution. > [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]: > {code:java} > final ChannelFuture connectFuture = bootstrap.connect(targetAddress, > targetPort); > final CompletableFuture channelFuture = new CompletableFuture<>(); > connectFuture.addListener( > (ChannelFuture future) -> { > if (future.isSuccess()) { > channelFuture.complete(future.channel()); > } else { > channelFuture.completeExceptionally(future.cause()); > } > }); > {code} > In this code, the call to {{addListener()}} can fail silently (only logging > to the console), meaning any code waiting on the CompletableFuture returned > by this method will deadlock. > There was some work in Netty around this back in 2015, but it's unclear to me > how this situation is expected to be handled given the discussion and changes > from these issues: > * [https://github.com/netty/netty/issues/3449] (open) > * [https://github.com/netty/netty/pull/3483] (closed) > * [https://github.com/netty/netty/pull/3566] (closed) > * [https://github.com/netty/netty/pull/5087] (merged) > I think a reasonable fix for Flink would be to check the state of > {{connectFuture}} and {{channelFuture}} immediately after the call to > {{addListener()}}, resolving {{channelFuture}} with > {{completeExceptionally()}} if {{connectFuture}} is done and failed and > {{channelFuture}} has not
[jira] [Commented] (FLINK-32583) RestClient can deadlock if request made after Netty event executor terminated
[ https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749829#comment-17749829 ] Patrick Lucas commented on FLINK-32583: --- [~mapohl] regarding the release note, I just wanted to clarify something. {quote}Fixes a race condition in the RestClient that could happen when submitting a request while closing the client. There was a small chance that the request submission wouldn't complete.{quote} That is one effect of this change, but the primary problem it solves is not a race condition, but that if a request is made in _any time after_ the client is closed, then the future will definitely never be resolved, and no exception thrown. If a caller were to call `.join()` on the returned `CompletableFuture`—which does not take a timeout—the caller will actually block forever. I think this is the more serious bug being fixed here than the very rare edge case that required most of the complexity in the change. Suggested release note: {quote}Fixes a bug in the RestClient where the response future of a request made after the client was closed is never completed.{quote} or {quote}Fixes a bug in the RestClient where making a request after the client was closed returns a future that never completes.{quote} > RestClient can deadlock if request made after Netty event executor terminated > - > > Key: FLINK-32583 > URL: https://issues.apache.org/jira/browse/FLINK-32583 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.18.0, 1.16.3, 1.17.2 >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > Labels: pull-request-available > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > The RestClient can deadlock if a request is made after the Netty event > executor has terminated. > This is due to the listener that would resolve the CompletableFuture that is > attached to the ChannelFuture returned by the call to Netty to connect not > being able to run because the executor to run it rejects the execution. > [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]: > {code:java} > final ChannelFuture connectFuture = bootstrap.connect(targetAddress, > targetPort); > final CompletableFuture channelFuture = new CompletableFuture<>(); > connectFuture.addListener( > (ChannelFuture future) -> { > if (future.isSuccess()) { > channelFuture.complete(future.channel()); > } else { > channelFuture.completeExceptionally(future.cause()); > } > }); > {code} > In this code, the call to {{addListener()}} can fail silently (only logging > to the console), meaning any code waiting on the CompletableFuture returned > by this method will deadlock. > There was some work in Netty around this back in 2015, but it's unclear to me > how this situation is expected to be handled given the discussion and changes > from these issues: > * [https://github.com/netty/netty/issues/3449] (open) > * [https://github.com/netty/netty/pull/3483] (closed) > * [https://github.com/netty/netty/pull/3566] (closed) > * [https://github.com/netty/netty/pull/5087] (merged) > I think a reasonable fix for Flink would be to check the state of > {{connectFuture}} and {{channelFuture}} immediately after the call to > {{addListener()}}, resolving {{channelFuture}} with > {{completeExceptionally()}} if {{connectFuture}} is done and failed and > {{channelFuture}} has not been completed. In the possible race condition > where the listener was attached successfully and the connection fails > instantly, the result is the same, as calls to > {{CompletableFuture#completeExceptionally()}} are idempotent. > A workaround for users of RestClient is to call {{CompletableFuture#get(long > timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the > CompletableFutures it returns. However, if the call throws TimeoutException, > the cause of the failure cannot easily be determined. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32583) RestClient can deadlock if request made after Netty event executor terminated
[ https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742435#comment-17742435 ] Patrick Lucas commented on FLINK-32583: --- Further investigation suggests that this can only happen when a request is made after the RestClient itself is closed, which asynchronously shuts down the event loop used by Netty. Many uses of RestClient will use try-with-resources where this wouldn't be an issue, but it should still have the correct behavior when used in other contexts where a request may be attempted after shutdown has already started. > RestClient can deadlock if request made after Netty event executor terminated > - > > Key: FLINK-32583 > URL: https://issues.apache.org/jira/browse/FLINK-32583 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.18.0, 1.16.3, 1.17.2 >Reporter: Patrick Lucas >Priority: Minor > Labels: pull-request-available > > The RestClient can deadlock if a request is made after the Netty event > executor has terminated. > This is due to the listener that would resolve the CompletableFuture that is > attached to the ChannelFuture returned by the call to Netty to connect not > being able to run because the executor to run it rejects the execution. > [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]: > {code:java} > final ChannelFuture connectFuture = bootstrap.connect(targetAddress, > targetPort); > final CompletableFuture channelFuture = new CompletableFuture<>(); > connectFuture.addListener( > (ChannelFuture future) -> { > if (future.isSuccess()) { > channelFuture.complete(future.channel()); > } else { > channelFuture.completeExceptionally(future.cause()); > } > }); > {code} > In this code, the call to {{addListener()}} can fail silently (only logging > to the console), meaning any code waiting on the CompletableFuture returned > by this method will deadlock. > There was some work in Netty around this back in 2015, but it's unclear to me > how this situation is expected to be handled given the discussion and changes > from these issues: > * [https://github.com/netty/netty/issues/3449] (open) > * [https://github.com/netty/netty/pull/3483] (closed) > * [https://github.com/netty/netty/pull/3566] (closed) > * [https://github.com/netty/netty/pull/5087] (merged) > I think a reasonable fix for Flink would be to check the state of > {{connectFuture}} and {{channelFuture}} immediately after the call to > {{addListener()}}, resolving {{channelFuture}} with > {{completeExceptionally()}} if {{connectFuture}} is done and failed and > {{channelFuture}} has not been completed. In the possible race condition > where the listener was attached successfully and the connection fails > instantly, the result is the same, as calls to > {{CompletableFuture#completeExceptionally()}} are idempotent. > A workaround for users of RestClient is to call {{CompletableFuture#get(long > timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the > CompletableFutures it returns. However, if the call throws TimeoutException, > the cause of the failure cannot easily be determined. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32583) RestClient can deadlock if request made after Netty event executor terminated
Patrick Lucas created FLINK-32583: - Summary: RestClient can deadlock if request made after Netty event executor terminated Key: FLINK-32583 URL: https://issues.apache.org/jira/browse/FLINK-32583 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.18.0, 1.16.3, 1.17.2 Reporter: Patrick Lucas The RestClient can deadlock if a request is made after the Netty event executor has terminated. This is due to the listener that would resolve the CompletableFuture that is attached to the ChannelFuture returned by the call to Netty to connect not being able to run because the executor to run it rejects the execution. [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]: {code:java} final ChannelFuture connectFuture = bootstrap.connect(targetAddress, targetPort); final CompletableFuture channelFuture = new CompletableFuture<>(); connectFuture.addListener( (ChannelFuture future) -> { if (future.isSuccess()) { channelFuture.complete(future.channel()); } else { channelFuture.completeExceptionally(future.cause()); } }); {code} In this code, the call to {{addListener()}} can fail silently (only logging to the console), meaning any code waiting on the CompletableFuture returned by this method will deadlock. There was some work in Netty around this back in 2015, but it's unclear to me how this situation is expected to be handled given the discussion and changes from these issues: * [https://github.com/netty/netty/issues/3449] (open) * [https://github.com/netty/netty/pull/3483] (closed) * [https://github.com/netty/netty/pull/3566] (closed) * [https://github.com/netty/netty/pull/5087] (merged) I think a reasonable fix for Flink would be to check the state of {{connectFuture}} and {{channelFuture}} immediately after the call to {{addListener()}}, resolving {{channelFuture}} with {{completeExceptionally()}} if {{connectFuture}} is done and failed and {{channelFuture}} has not been completed. In the possible race condition where the listener was attached successfully and the connection fails instantly, the result is the same, as calls to {{CompletableFuture#completeExceptionally()}} are idempotent. A workaround for users of RestClient is to call {{CompletableFuture#get(long timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the CompletableFutures it returns. However, if the call throws TimeoutException, the cause of the failure cannot easily be determined. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-13703) AvroTypeInfo requires objects to be strict POJOs (mutable, with setters)
[ https://issues.apache.org/jira/browse/FLINK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363519#comment-17363519 ] Patrick Lucas commented on FLINK-13703: --- Hit a very similar issue (though didn't think it was the same at first), when using the option `gettersReturnOptional`. With this option, all getters return Optional (mainly useful along with `optionalGettersForNullableFieldsOnly` to limit it to only nullable fields), which breaks Flink's POJO analyzer in a similar way and results in the same exception. I never hit this before because I've tended to not have nullable fields. > AvroTypeInfo requires objects to be strict POJOs (mutable, with setters) > > > Key: FLINK-13703 > URL: https://issues.apache.org/jira/browse/FLINK-13703 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Alexander Fedulov >Priority: Minor > > There exists an option to generate Avro sources which would represent > immutable objects (`createSetters` option set to false) > [\[1\]|https://github.com/commercehub-oss/gradle-avro-plugin] , > [\[2\]|https://avro.apache.org/docs/current/api/java/org/apache/avro/mojo/AbstractAvroMojo.html]. > Those objects still have full arguments constructors and are being correctly > dealt with by Avro. > `AvroTypeInfo` in Flink performs a check to verify if a Class complies to > the strict POJO requirements (including setters) and throws an > IllegalStateException("Expecting type to be a PojoTypeInfo") otherwise. Can > this check be relaxed to provide better immutability support? > +Steps to reproduce:+ > 1) Generate Avro sources from schema using `createSetters` option. > 2) Use generated class in > `ConfluentRegistryAvroDeserializationSchema.forSpecific(GeneratedClass.class, > schemaRegistryUrl)` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13703) AvroTypeInfo requires objects to be strict POJOs (mutable, with setters)
[ https://issues.apache.org/jira/browse/FLINK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290867#comment-17290867 ] Patrick Lucas commented on FLINK-13703: --- I ran into this as well and was thrown off by the confusing error message—glad I found this issue to help explain it, thanks [~afedulov]. I have more experience with protobuf which tends to be immutable-by-default, so I set Avro's {{createSetters}} flag when starting out. I'll undo that for now, but supporting immutable specific records would still be nice. > AvroTypeInfo requires objects to be strict POJOs (mutable, with setters) > > > Key: FLINK-13703 > URL: https://issues.apache.org/jira/browse/FLINK-13703 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Alexander Fedulov >Priority: Minor > > There exists an option to generate Avro sources which would represent > immutable objects (`createSetters` option set to false) > [\[1\]|https://github.com/commercehub-oss/gradle-avro-plugin] , > [\[2\]|https://avro.apache.org/docs/current/api/java/org/apache/avro/mojo/AbstractAvroMojo.html]. > Those objects still have full arguments constructors and are being correctly > dealt with by Avro. > `AvroTypeInfo` in Flink performs a check to verify if a Class complies to > the strict POJO requirements (including setters) and throws an > IllegalStateException("Expecting type to be a PojoTypeInfo") otherwise. Can > this check be relaxed to provide better immutability support? > +Steps to reproduce:+ > 1) Generate Avro sources from schema using `createSetters` option. > 2) Use generated class in > `ConfluentRegistryAvroDeserializationSchema.forSpecific(GeneratedClass.class, > schemaRegistryUrl)` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19159) Using Scalafmt to format scala source code
[ https://issues.apache.org/jira/browse/FLINK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192174#comment-17192174 ] Patrick Lucas commented on FLINK-19159: --- [~aljoscha] some plugins, like [Spotless|https://github.com/diffplug/spotless/tree/main/plugin-maven] (which supports Maven & scalafmt) have an option to only enforce style on a file when that file has changed from a given Git ref (["ratchet"|https://github.com/diffplug/spotless/tree/main/plugin-maven#ratchet]). There would still be some noise as files are formatted, chiefly impacting large files, but maybe that's a compromise that works for Flink? > Using Scalafmt to format scala source code > -- > > Key: FLINK-19159 > URL: https://issues.apache.org/jira/browse/FLINK-19159 > Project: Flink > Issue Type: Improvement >Reporter: darion yaphet >Priority: Minor > > Scalafmt is a code formatter for Scala. It can help developer to avoid code > style conflict -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17324) Move the image we use to generate the docker official-images changes into flink-docker
[ https://issues.apache.org/jira/browse/FLINK-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089631#comment-17089631 ] Patrick Lucas commented on FLINK-17324: --- That image is just a convenience wrapper for [bashbrew|https://github.com/docker-library/official-images/tree/master/bashbrew]. Here's the whole Dockerfile, assuming the binary {{bashbrew-amd64}} is in the directory: {code} FROM alpine RUN apk --no-cache add bash git COPY bashbrew-amd64 /usr/local/bin/bashbrew {code} There are definitely plenty of other options for how to make this available to the {{generate-stackbrew-library.sh}} script, I'm really not opinionated on how it gets done. The tool is used to enumerate the architectures supported by the base image referenced in the FROM line of the Flink images, and there might also be another way to accomplish the same thing. > Move the image we use to generate the docker official-images changes into > flink-docker > -- > > Key: FLINK-17324 > URL: https://issues.apache.org/jira/browse/FLINK-17324 > Project: Flink > Issue Type: Improvement > Components: Release System / Docker >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > > Before the docker official image was repatriated into Apache Flink we used a > docker image that contained the scripts to generate the release here: > https://github.com/apache/flink-docker/blob/e0107b0f85d9b3630db39568729e36408cdc7b78/generate-stackbrew-library-docker.sh#L5. > {quote}docker run --rm > --volume ~/projects/docker-flink:/build > plucas/docker-flink-build > /{{build/generate-stackbrew-library.sh > ~/projects/official-images > /library/flink}} > {quote} > Notice that this docker image tool 'plucas/docker-flink-build' is not part of > upstream Flink so we need to move it there into some sort of tools section in > the flink-docker repo or document an alternative to it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process
[ https://issues.apache.org/jira/browse/FLINK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-15828. - We can continue to improve this integration in new issues. > Integrate docker-flink/docker-flink into Flink release process > -- > > Key: FLINK-15828 > URL: https://issues.apache.org/jira/browse/FLINK-15828 > Project: Flink > Issue Type: Improvement > Components: Deployment / Docker, Release System >Reporter: Ufuk Celebi >Priority: Major > > This ticket tracks the first phase of Flink Docker image build consolidation. > The goal of this story is to integrate Docker image publication with the > Flink release process and provide convenience packages of released Flink > artifacts on DockerHub. > For more details, check the > [DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html] > and > [VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html] > threads on the mailing list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process
[ https://issues.apache.org/jira/browse/FLINK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas resolved FLINK-15828. --- Resolution: Fixed > Integrate docker-flink/docker-flink into Flink release process > -- > > Key: FLINK-15828 > URL: https://issues.apache.org/jira/browse/FLINK-15828 > Project: Flink > Issue Type: Improvement > Components: Deployment / Docker, Release System >Reporter: Ufuk Celebi >Priority: Major > > This ticket tracks the first phase of Flink Docker image build consolidation. > The goal of this story is to integrate Docker image publication with the > Flink release process and provide convenience packages of released Flink > artifacts on DockerHub. > For more details, check the > [DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html] > and > [VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html] > threads on the mailing list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-15829) Request apache/flink-docker repository
[ https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-15829. - > Request apache/flink-docker repository > -- > > Key: FLINK-15829 > URL: https://issues.apache.org/jira/browse/FLINK-15829 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-15831) Add Docker image publication to release documentation
[ https://issues.apache.org/jira/browse/FLINK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-15831. - > Add Docker image publication to release documentation > - > > Key: FLINK-15831 > URL: https://issues.apache.org/jira/browse/FLINK-15831 > Project: Flink > Issue Type: Sub-task > Components: Documentation >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The [release documentation in the project > Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] > describes the release process. > We need to add a note to follow up with the Docker image publication process > as part of the release checklist. The actual documentation should probably be > self-contained in the apache/flink-docker repository, but we should > definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-15831) Add Docker image publication to release documentation
[ https://issues.apache.org/jira/browse/FLINK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas resolved FLINK-15831. --- Resolution: Fixed > Add Docker image publication to release documentation > - > > Key: FLINK-15831 > URL: https://issues.apache.org/jira/browse/FLINK-15831 > Project: Flink > Issue Type: Sub-task > Components: Documentation >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The [release documentation in the project > Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] > describes the release process. > We need to add a note to follow up with the Docker image publication process > as part of the release checklist. The actual documentation should probably be > self-contained in the apache/flink-docker repository, but we should > definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15831) Add Docker image publication to release documentation
[ https://issues.apache.org/jira/browse/FLINK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033556#comment-17033556 ] Patrick Lucas commented on FLINK-15831: --- After discussion on the mailing list, I updated the [Flink release guide|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release], and once [#5|https://github.com/apache/flink-docker/pull/5] has been merged I think we can close this issue. There's still room for improvement here, for example adding more automation to the workflow and somehow making release candidate images available during the release process, but we can tackle that in a follow-up issue. > Add Docker image publication to release documentation > - > > Key: FLINK-15831 > URL: https://issues.apache.org/jira/browse/FLINK-15831 > Project: Flink > Issue Type: Sub-task > Components: Documentation >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The [release documentation in the project > Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] > describes the release process. > We need to add a note to follow up with the Docker image publication process > as part of the release checklist. The actual documentation should probably be > self-contained in the apache/flink-docker repository, but we should > definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14973) OSS filesystem does not relocate many dependencies
Patrick Lucas created FLINK-14973: - Summary: OSS filesystem does not relocate many dependencies Key: FLINK-14973 URL: https://issues.apache.org/jira/browse/FLINK-14973 Project: Flink Issue Type: Bug Components: FileSystems Affects Versions: 1.9.1 Reporter: Patrick Lucas Whereas the Azure and S3 Hadoop filesystem jars relocate all of their depdendencies: {noformat} $ jar tf opt/flink-azure-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/' | grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq org/ org/apache $ jar tf opt/flink-s3-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/' | grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq org/ org/apache {noformat} The OSS Hadoop filesystem leaves many things un-relocated: {noformat} $ jar tf opt/flink-oss-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/' | grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq assets/ assets/org avro/ avro/shaded com/ com/ctc com/fasterxml com/google com/jcraft com/nimbusds com/sun com/thoughtworks javax/ javax/activation javax/el javax/servlet javax/ws javax/xml jersey/ jersey/repackaged licenses/ licenses/LICENSE.asm licenses/LICENSE.cddlv1.0 licenses/LICENSE.cddlv1.1 licenses/LICENSE.jdom licenses/LICENSE.jzlib licenses/LICENSE.paranamer licenses/LICENSE.protobuf licenses/LICENSE.re2j licenses/LICENSE.stax2api net/ net/jcip net/minidev org/ org/apache org/codehaus org/eclipse org/jdom org/objectweb org/tukaani org/xerial {noformat} The first symptom of this I ran into was that Flink is unable to restore from a savepoint if both the OSS and Azure Hadoop filesystems are on the classpath, but I assume this has the potential to cause further problems, at least until more progress is made on the module/classloading front. h3. Steps to reproduce # Copy both the Azure and OSS Hadoop filesystem JARs from opt/ into lib/ # Run a job that restores from a savepoint (the savepoint might need to be stored on OSS) # See a crash and traceback like: {noformat} 2019-11-26 15:59:25,318 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id ----. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232] at org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:691) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:753) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) ~[?:1.8.0_232] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2] at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
[jira] [Commented] (FLINK-14812) Add metric libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978444#comment-16978444 ] Patrick Lucas commented on FLINK-14812: --- Supporting some env var to augment the classpath used when running Flink sounds totally reasonable to me, though for users who want variations on the upstream image I still don't think it's a bad idea for them to create their own images—especially if they are already set up to build Docker images. In this case, a Dockerfile could be: {code} FROM flink:1.9 RUN ln -s ${FLINK_HOME}/opt/flink-metrics-prometheus-*.jar ${FLINK_HOME}/lib/ {code} > Add metric libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > To use the Flink metric lib you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-12191) Flink SVGs on "Material" page broken, render incorrectly on Firefox
[ https://issues.apache.org/jira/browse/FLINK-12191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-12191. - Resolution: Fixed > Flink SVGs on "Material" page broken, render incorrectly on Firefox > --- > > Key: FLINK-12191 > URL: https://issues.apache.org/jira/browse/FLINK-12191 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2019-04-15 at 09.48.15.png > > Time Spent: 20m > Remaining Estimate: 0h > > Like FLINK-11043, the Flink SVGs on the [Material page of the Flink > website|https://flink.apache.org/material.html] are invalid and do not render > correctly on Firefox. > I'm not sure if there is an additional source-of-truth for these images, or > if these hosted on the website are canonical, but I can fix them nonetheless. > I also noticed that one of the squirrels in both {{color_black.svg}} and > {{color_white.svg}} is missing the eye gradient, which can also be easily > fixed. > !Screen Shot 2019-04-15 at 09.48.15.png|thumbnail! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12191) Flink SVGs on "Material" page broken, render incorrectly on Firefox
Patrick Lucas created FLINK-12191: - Summary: Flink SVGs on "Material" page broken, render incorrectly on Firefox Key: FLINK-12191 URL: https://issues.apache.org/jira/browse/FLINK-12191 Project: Flink Issue Type: Bug Components: Project Website Reporter: Patrick Lucas Assignee: Patrick Lucas Attachments: Screen Shot 2019-04-15 at 09.48.15.png Like FLINK-11043, the Flink SVGs on the [Material page of the Flink website|https://flink.apache.org/material.html] are invalid and do not render correctly on Firefox. I'm not sure if there is an additional source-of-truth for these images, or if these hosted on the website are canonical, but I can fix them nonetheless. I also noticed that one of the squirrels in both {{color_black.svg}} and {{color_white.svg}} is missing the eye gradient, which can also be easily fixed. !Screen Shot 2019-04-15 at 09.48.15.png|thumbnail! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11462) "Powered by Flink" page does not display correctly on Firefox
Patrick Lucas created FLINK-11462: - Summary: "Powered by Flink" page does not display correctly on Firefox Key: FLINK-11462 URL: https://issues.apache.org/jira/browse/FLINK-11462 Project: Flink Issue Type: Bug Components: Project Website Reporter: Patrick Lucas Assignee: Patrick Lucas Attachments: Screen Shot 2019-01-30 at 12.13.03.png The JavaScript that sets the height of each "tile" does not work as expected in Firefox, causing them to overlap (see attached screenshot). [This Stack Overflow post|https://stackoverflow.com/questions/12184133/jquery-wrong-values-when-trying-to-get-div-height] suggests using jQuery's {{outerHeight}} instead of {{height}} method, and making this change seems to make the page display correctly in both Firefox and Chrome. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720154#comment-16720154 ] Patrick Lucas commented on FLINK-11127: --- StatefulSets now support the podManagementPolicy option of "Parallel", which allows all the Pods to launch or terminate at the same time, which partially solves the problem, though experimentation is needed to really know whether that is preferable to using a Deployment. (See: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies) Would it be feasible for the JMs to connect to the TMs by IP address (as they do naturally for other types of communication)? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11043) Website Flink logo broken on Firefox
[ https://issues.apache.org/jira/browse/FLINK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-11043. - Resolution: Fixed > Website Flink logo broken on Firefox > > > Key: FLINK-11043 > URL: https://issues.apache.org/jira/browse/FLINK-11043 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > Labels: pull-request-available > Attachments: Screen Shot 2018-11-30 at 14.48.43.png > > > The gradients the Flink SVG logo do not render on Firefox due to this > (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] > Example: > !Screen Shot 2018-11-30 at 14.48.43.png! > [This StackOverflow > post|https://stackoverflow.com/questions/12867704/svg-linear-gradient-does-not-work-in-firefox] > links to the above issue and explains that Firefox does not render gradient > elements that are inside a "symbol" element in SVGs, and the fix is to > extract all the gradient definitions into a top-level "defs" section. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11043) Website Flink logo broken on Firefox
[ https://issues.apache.org/jira/browse/FLINK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-11043: -- Description: The gradients the Flink SVG logo do not render on Firefox due to this (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] Example: !Screen Shot 2018-11-30 at 14.48.43.png! [This StackOverflow post|https://stackoverflow.com/questions/12867704/svg-linear-gradient-does-not-work-in-firefox] links to the above issue and explains that Firefox does not render gradient elements that are inside a "symbol" element in SVGs, and the fix is to extract all the gradient definitions into a top-level "defs" section. was: The gradients the Flink SVG logo do not render on Firefox due to this (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] Example: !Screen Shot 2018-11-30 at 14.48.43.png! This StackOverflow post links to the above issue and explains that Firefox does not render gradient elements that are inside a "symbol" element in SVGs, and the fix is to extract all the gradient definitions into a top-level "defs" section. > Website Flink logo broken on Firefox > > > Key: FLINK-11043 > URL: https://issues.apache.org/jira/browse/FLINK-11043 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > Labels: pull-request-available > Attachments: Screen Shot 2018-11-30 at 14.48.43.png > > > The gradients the Flink SVG logo do not render on Firefox due to this > (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] > Example: > !Screen Shot 2018-11-30 at 14.48.43.png! > [This StackOverflow > post|https://stackoverflow.com/questions/12867704/svg-linear-gradient-does-not-work-in-firefox] > links to the above issue and explains that Firefox does not render gradient > elements that are inside a "symbol" element in SVGs, and the fix is to > extract all the gradient definitions into a top-level "defs" section. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11043) Website Flink logo broken on Firefox
Patrick Lucas created FLINK-11043: - Summary: Website Flink logo broken on Firefox Key: FLINK-11043 URL: https://issues.apache.org/jira/browse/FLINK-11043 Project: Flink Issue Type: Bug Components: Project Website Reporter: Patrick Lucas Assignee: Patrick Lucas Attachments: Screen Shot 2018-11-30 at 14.48.43.png The gradients the Flink SVG logo do not render on Firefox due to this (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] Example: !Screen Shot 2018-11-30 at 14.48.43.png! This StackOverflow post links to the above issue and explains that Firefox does not render gradient elements that are inside a "symbol" element in SVGs, and the fix is to extract all the gradient definitions into a top-level "defs" section. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11043) Website Flink logo broken on Firefox
[ https://issues.apache.org/jira/browse/FLINK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-11043: -- Priority: Minor (was: Major) > Website Flink logo broken on Firefox > > > Key: FLINK-11043 > URL: https://issues.apache.org/jira/browse/FLINK-11043 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > Labels: pull-request-available > Attachments: Screen Shot 2018-11-30 at 14.48.43.png > > > The gradients the Flink SVG logo do not render on Firefox due to this > (12-year-old) bug: [https://bugzilla.mozilla.org/show_bug.cgi?id=353575] > Example: > !Screen Shot 2018-11-30 at 14.48.43.png! > This StackOverflow post links to the above issue and explains that Firefox > does not render gradient elements that are inside a "symbol" element in SVGs, > and the fix is to extract all the gradient definitions into a top-level > "defs" section. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10007) Security vulnerability in website build infrastructure
[ https://issues.apache.org/jira/browse/FLINK-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566645#comment-16566645 ] Patrick Lucas commented on FLINK-10007: --- FWIW, this vulnerability has to do with a specific JSON string being able to crash ruby, which could result in DoS to a service running it. It's basically inconsequential to us, but might as well check it off the list I guess. https://www.cvedetails.com/cve/CVE-2017-16516/ > Security vulnerability in website build infrastructure > -- > > Key: FLINK-10007 > URL: https://issues.apache.org/jira/browse/FLINK-10007 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Fabian Hueske >Priority: Critical > > We've got a notification from Apache INFRA about a potential security > vulnerability: > {quote} > We found a potential security vulnerability in a repository for which you > have been granted security alert access. > @apache apache/flink-web > Known high severity security vulnerability detected in yajl-ruby < 1.3.1 > defined in Gemfile. > Gemfile update suggested: yajl-ruby ~> 1.3.1. > {quote} > This is a problem with the build environment of the website, i.e., this > dependency is not distributed or executed with Flink but only run when the > website is updated. > Nonetheless, we should of course update the dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9806) Add a canonical link element to documentation HTML
[ https://issues.apache.org/jira/browse/FLINK-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563249#comment-16563249 ] Patrick Lucas commented on FLINK-9806: -- I'm pleased with the impact of this so far: three test Google queries, "flink metrics", "flink queryable state", and "flink flatmap", all have a link from the stable docs as the first result. However, DuckDuckGo, Bing, and Baidu still show terribly out of date results for these queries, some even with the Flink 1.1.2 docs as the top result and no "stable" URLs in sight. Improving the experience on Google is still a good result. > Add a canonical link element to documentation HTML > -- > > Key: FLINK-9806 > URL: https://issues.apache.org/jira/browse/FLINK-9806 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > Fix For: 1.3.4, 1.4.3, 1.5.3, 1.6.0 > > > Flink has suffered for a while with non-optimal SEO for its documentation, > meaning a web search for a topic covered in the documentation often produces > results for many versions of Flink, even preferring older versions since > those pages have been around for longer. > Using a canonical link element (see references) may alleviate this by > informing search engines about where to find the latest documentation (i.e. > pages hosted under [https://ci.apache.org/projects/flink/flink-docs-master/).] > I think this is at least worth experimenting with, and if it doesn't cause > problems, even backporting it to the older release branches to eventually > clean up the Flink docs' SEO and converge on advertising only the latest docs > (unless a specific version is specified). > References: > * [https://moz.com/learn/seo/canonicalization] > * [https://yoast.com/rel-canonical/] > * [https://support.google.com/webmasters/answer/139066?hl=en] > * [https://en.wikipedia.org/wiki/Canonical_link_element] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9914) Flink docker information in official website is out of date and should be update
[ https://issues.apache.org/jira/browse/FLINK-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553007#comment-16553007 ] Patrick Lucas commented on FLINK-9914: -- The version numbers in those docs are just for example purposes, and don't need to be updated for each release. However, there have been some slight changes to the images, so updating these docs anyway is in order. > Flink docker information in official website is out of date and should be > update > > > Key: FLINK-9914 > URL: https://issues.apache.org/jira/browse/FLINK-9914 > Project: Flink > Issue Type: Bug >Reporter: vinoyang >Assignee: vinoyang >Priority: Minor > > The documentation still use Flink 1.2.1, but the latest Flink version is 1.5.x -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-9806) Add a canonical link element to documentation HTML
[ https://issues.apache.org/jira/browse/FLINK-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-9806: Assignee: Patrick Lucas > Add a canonical link element to documentation HTML > -- > > Key: FLINK-9806 > URL: https://issues.apache.org/jira/browse/FLINK-9806 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Major > Labels: pull-request-available > > Flink has suffered for a while with non-optimal SEO for its documentation, > meaning a web search for a topic covered in the documentation often produces > results for many versions of Flink, even preferring older versions since > those pages have been around for longer. > Using a canonical link element (see references) may alleviate this by > informing search engines about where to find the latest documentation (i.e. > pages hosted under [https://ci.apache.org/projects/flink/flink-docs-master/).] > I think this is at least worth experimenting with, and if it doesn't cause > problems, even backporting it to the older release branches to eventually > clean up the Flink docs' SEO and converge on advertising only the latest docs > (unless a specific version is specified). > References: > * [https://moz.com/learn/seo/canonicalization] > * [https://yoast.com/rel-canonical/] > * [https://support.google.com/webmasters/answer/139066?hl=en] > * [https://en.wikipedia.org/wiki/Canonical_link_element] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9806) Add a canonical link element to documentation HTML
Patrick Lucas created FLINK-9806: Summary: Add a canonical link element to documentation HTML Key: FLINK-9806 URL: https://issues.apache.org/jira/browse/FLINK-9806 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.5.0 Reporter: Patrick Lucas Flink has suffered for a while with non-optimal SEO for its documentation, meaning a web search for a topic covered in the documentation often produces results for many versions of Flink, even preferring older versions since those pages have been around for longer. Using a canonical link element (see references) may alleviate this by informing search engines about where to find the latest documentation (i.e. pages hosted under [https://ci.apache.org/projects/flink/flink-docs-master/).] I think this is at least worth experimenting with, and if it doesn't cause problems, even backporting it to the older release branches to eventually clean up the Flink docs' SEO and converge on advertising only the latest docs (unless a specific version is specified). References: * [https://moz.com/learn/seo/canonicalization] * [https://yoast.com/rel-canonical/] * [https://support.google.com/webmasters/answer/139066?hl=en] * [https://en.wikipedia.org/wiki/Canonical_link_element] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8154) JobSubmissionClientActor submited job,but there is no connection to a JobManager
[ https://issues.apache.org/jira/browse/FLINK-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266782#comment-16266782 ] Patrick Lucas commented on FLINK-8154: -- Is `act-monitor-flink-jobmanager` a Kubernetes Service? If so, it's looking like the Service's IP is `10.3.0.81`, but the Pod's IP is `10.2.43.51`. Depending on your network configuration, these may not be routable to each other. Can you confirm connectivity by, for example, running `nc -v 10.2.43.51 6123` from within the Pod? (You may have to install `netcat`) > JobSubmissionClientActor submited job,but there is no connection to a > JobManager > - > > Key: FLINK-8154 > URL: https://issues.apache.org/jira/browse/FLINK-8154 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.3.2 > Environment: Kubernetes 1.8.3, Platform "Linux/amd64" >Reporter: Gregory Melekh >Priority: Blocker > > There is JobManager log file bellow. > 2017-11-26 08:17:13,435 INFO org.apache.flink.client.CliFrontend > - > > 2017-11-26 08:17:13,437 INFO org.apache.flink.client.CliFrontend > - Starting Command Line Client (Version: 1.3.2, Rev:0399bee, > Date:03.08.2017 @ 10:23:11 UTC) > 2017-11-26 08:17:13,437 INFO org.apache.flink.client.CliFrontend > - Current user: root > 2017-11-26 08:17:13,437 INFO org.apache.flink.client.CliFrontend > - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - > 1.8/25.131-b11 > 2017-11-26 08:17:13,437 INFO org.apache.flink.client.CliFrontend > - Maximum heap size: 6252 MiBytes > 2017-11-26 08:17:13,437 INFO org.apache.flink.client.CliFrontend > - JAVA_HOME: /usr/lib/jvm/java-1.8-openjdk/jre > 2017-11-26 08:17:13,439 INFO org.apache.flink.client.CliFrontend > - Hadoop version: 2.7.2 > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - JVM Options: > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - > -Dlog.file=/opt/flink/log/flink--client-act-monitor-flink-jobmanager-66cd4bdb5c-8kxbh.log > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - -Dlog4j.configuration=file:/etc/flink/log4j-cli.properties > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - -Dlogback.configurationFile=file:/etc/flink/logback.xml > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - Program Arguments: > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - run > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - -c > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - monitoring.flow.AccumulateAll > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - /tmp/monitoring-0.0.1-SNAPSHOT.jar > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - Classpath: > /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar::: > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - > > 2017-11-26 08:17:13,440 INFO org.apache.flink.client.CliFrontend > - Using configuration directory /etc/flink > 2017-11-26 08:17:13,441 INFO org.apache.flink.client.CliFrontend > - Trying to load configuration file > 2017-11-26 08:17:13,443 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: blob.server.port, 6124 > 2017-11-26 08:17:13,443 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.address, act-monitor-flink-jobmanager > 2017-11-26 08:17:13,443 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-11-26 08:17:13,443 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.heap.mb, 1024 > 2017-11-26 08:17:13,443 INFO >
[jira] [Commented] (FLINK-6369) Better support for overlay networks
[ https://issues.apache.org/jira/browse/FLINK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155461#comment-16155461 ] Patrick Lucas commented on FLINK-6369: -- [~till.rohrmann]: yeah that's what I was getting at. It doesn't really solve our problem for that reason---though if you just Googled around a bit you might come across that documentation and think it does. I definitely agree that an HTTP control mechanism is the way to go. > Better support for overlay networks > --- > > Key: FLINK-6369 > URL: https://issues.apache.org/jira/browse/FLINK-6369 > Project: Flink > Issue Type: Improvement > Components: Docker, Network >Affects Versions: 1.2.0 >Reporter: Patrick Lucas > Fix For: 1.4.0 > > > Running Flink in an environment that utilizes an overlay network > (containerized environments like Kubernetes or Docker Compose, or cloud > platforms like AWS VPC) poses various challenges related to networking. > The core problem is that in these environments, applications are frequently > addressed by a name different from that with which the application sees > itself. > For instance, it is plausible that the Flink UI (served by the Jobmanager) is > accessed via an ELB, which poses a problem in HA mode when the non-leader UI > returns an HTTP redirect to the leader—but the user may not be able to > connect directly to the leader. > Or, if a user is using [Docker > Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml], > they cannot submit a job via the CLI since there is a mismatch between the > name used to address the Jobmanager and what the Jobmanager perceives as its > hostname. (see \[1] below for more detail) > > h3. Problems and proposed solutions > There are four instances of this issue that I've run into so far: > h4. Jobmanagers must be addressed by the same name they are configured with > due to limitations of Akka > Akka enforces that messages it receives are addressed with the hostname it is > configured with. Newer versions of Akka (>= 2.4) than what Flink uses > (2.3-custom) have support for accepting messages with the "wrong" hostname, > but it limited to a single "external" hostname. > In many environments, it is likely that not all parties that want to connect > to the Jobmanager have the same way of addressing it (e.g. the ELB example > above). Other similarly-used protocols like HTTP generally don't have this > restriction: if you connect on a socket and send a well-formed message, the > system assumes that it is the desired recipient. > One solution is to not use Akka at all when communicating with the cluster > from the outside, perhaps using an HTTP API instead. This would be somewhat > involved, and probabyl best left as a longer-term goal. > A more immediate solution would be to override this behavior within Flakka, > the custom fork of Akka currently in use by Flink. I'm not sure how much > effort this would take. > h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. > metrics collection > Having the Taskmanagers register themselves by IP is probably the best > solution here. It's a reasonable assumption that IPs can always be used for > communication between the nodes of a single cluster. Asking that each > Taskmanager container have a resolvable hostname is unreasonable. > h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally > resolvable/routable > If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders > (such as if you put a Kubernetes Service in front of all Jobmanagers in a > cluster) get redirected to the (supposed) hostname of the leader, but this is > potentially unresolvable/unroutable externally. > Enabling non-leader Jobmanagers to proxy API calls to the leader would solve > this. The non-leaders could even serve static asset requests (e.g. for css or > js files) directly. > h4. Queryable state requests involve direct communication with Taskmanagers > Currently, queryable state requests involve communication between the client > and the Jobmanager (for key partitioning lookups) and between the client and > all Taskmanagers. > If the client is inside the network (as would be common in production > use-cases where high-volume lookups are required) this is a non-issue, but > problems crop up if the client is outside the network. > For the communication with the Jobmanager, a similar solution as above can be > used: if all Jobmanagers can service all key partitioning lookup requests > (e.g. by proxying) then a simple Service can be used. > The story is a bit different for the Taskmanagers. The partitioning lookup to > the Jobmanager would return the name of the particular Taskmanager that owned >
[jira] [Created] (FLINK-7155) Add Influxdb metrics reporter
Patrick Lucas created FLINK-7155: Summary: Add Influxdb metrics reporter Key: FLINK-7155 URL: https://issues.apache.org/jira/browse/FLINK-7155 Project: Flink Issue Type: Improvement Reporter: Patrick Lucas Assignee: Patrick Lucas Fix For: 1.4.0 [~jgrier] has a [simple Influxdb metrics reporter for Flink|https://github.com/jgrier/flink-stuff/tree/master/flink-influx-reporter] that is a thing wrapper around [a lightweight, public-domain Influxdb reporter|https://github.com/davidB/metrics-influxdb] for Codahale metrics. We can implement this very easily in Java in the same as as flink-metrics-graphite. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (FLINK-6369) Better support for overlay networks
[ https://issues.apache.org/jira/browse/FLINK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-6369: - Fix Version/s: (was: 1.3.1) 1.4.0 > Better support for overlay networks > --- > > Key: FLINK-6369 > URL: https://issues.apache.org/jira/browse/FLINK-6369 > Project: Flink > Issue Type: Improvement > Components: Docker, Network >Affects Versions: 1.2.0 >Reporter: Patrick Lucas > Fix For: 1.4.0 > > > Running Flink in an environment that utilizes an overlay network > (containerized environments like Kubernetes or Docker Compose, or cloud > platforms like AWS VPC) poses various challenges related to networking. > The core problem is that in these environments, applications are frequently > addressed by a name different from that with which the application sees > itself. > For instance, it is plausible that the Flink UI (served by the Jobmanager) is > accessed via an ELB, which poses a problem in HA mode when the non-leader UI > returns an HTTP redirect to the leader—but the user may not be able to > connect directly to the leader. > Or, if a user is using [Docker > Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml], > they cannot submit a job via the CLI since there is a mismatch between the > name used to address the Jobmanager and what the Jobmanager perceives as its > hostname. (see \[1] below for more detail) > > h3. Problems and proposed solutions > There are four instances of this issue that I've run into so far: > h4. Jobmanagers must be addressed by the same name they are configured with > due to limitations of Akka > Akka enforces that messages it receives are addressed with the hostname it is > configured with. Newer versions of Akka (>= 2.4) than what Flink uses > (2.3-custom) have support for accepting messages with the "wrong" hostname, > but it limited to a single "external" hostname. > In many environments, it is likely that not all parties that want to connect > to the Jobmanager have the same way of addressing it (e.g. the ELB example > above). Other similarly-used protocols like HTTP generally don't have this > restriction: if you connect on a socket and send a well-formed message, the > system assumes that it is the desired recipient. > One solution is to not use Akka at all when communicating with the cluster > from the outside, perhaps using an HTTP API instead. This would be somewhat > involved, and probabyl best left as a longer-term goal. > A more immediate solution would be to override this behavior within Flakka, > the custom fork of Akka currently in use by Flink. I'm not sure how much > effort this would take. > h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. > metrics collection > Having the Taskmanagers register themselves by IP is probably the best > solution here. It's a reasonable assumption that IPs can always be used for > communication between the nodes of a single cluster. Asking that each > Taskmanager container have a resolvable hostname is unreasonable. > h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally > resolvable/routable > If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders > (such as if you put a Kubernetes Service in front of all Jobmanagers in a > cluster) get redirected to the (supposed) hostname of the leader, but this is > potentially unresolvable/unroutable externally. > Enabling non-leader Jobmanagers to proxy API calls to the leader would solve > this. The non-leaders could even serve static asset requests (e.g. for css or > js files) directly. > h4. Queryable state requests involve direct communication with Taskmanagers > Currently, queryable state requests involve communication between the client > and the Jobmanager (for key partitioning lookups) and between the client and > all Taskmanagers. > If the client is inside the network (as would be common in production > use-cases where high-volume lookups are required) this is a non-issue, but > problems crop up if the client is outside the network. > For the communication with the Jobmanager, a similar solution as above can be > used: if all Jobmanagers can service all key partitioning lookup requests > (e.g. by proxying) then a simple Service can be used. > The story is a bit different for the Taskmanagers. The partitioning lookup to > the Jobmanager would return the name of the particular Taskmanager that owned > the desired data, but that name (likely an IP, as proposed in the second > section above) is not necessarily resolvable/routable from the client. > In the context of Kubernetes, where individual containers are generally not > addressible, a very ugly
[jira] [Closed] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-3026. Resolution: Fixed > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004389#comment-16004389 ] Patrick Lucas commented on FLINK-3026: -- We now have [official Docker images|https://hub.docker.com/_/flink/]! Any remaining discussion from this issue should be moved to a new one, as this one is now resolved. > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5997) Docker build tools should inline GPG key to verify HTTP Flink release download
[ https://issues.apache.org/jira/browse/FLINK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004386#comment-16004386 ] Patrick Lucas commented on FLINK-5997: -- This is done in the [official Docker images|https://github.com/docker-flink/docker-flink/blob/39a2b675f0ed0f8bf53054a66d67efebc6758009/Dockerfile-debian.template#L66]. > Docker build tools should inline GPG key to verify HTTP Flink release download > -- > > Key: FLINK-5997 > URL: https://issues.apache.org/jira/browse/FLINK-5997 > Project: Flink > Issue Type: Improvement > Components: Docker >Reporter: Patrick Lucas >Assignee: Patrick Lucas > > See examples > [here|https://github.com/docker-solr/docker-solr/blob/8afee8bf17a7a3d1fdc453643abfdb5cf2e822d0/6.4/Dockerfile] > and > [here|https://github.com/docker-library/official-images/blob/8b10b36d19a8cfba56bf4352c8ae98d3b6697eb3/README.md#init]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5997) Docker build tools should inline GPG key to verify HTTP Flink release download
[ https://issues.apache.org/jira/browse/FLINK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-5997. Resolution: Fixed > Docker build tools should inline GPG key to verify HTTP Flink release download > -- > > Key: FLINK-5997 > URL: https://issues.apache.org/jira/browse/FLINK-5997 > Project: Flink > Issue Type: Improvement > Components: Docker >Reporter: Patrick Lucas >Assignee: Patrick Lucas > > See examples > [here|https://github.com/docker-solr/docker-solr/blob/8afee8bf17a7a3d1fdc453643abfdb5cf2e822d0/6.4/Dockerfile] > and > [here|https://github.com/docker-library/official-images/blob/8b10b36d19a8cfba56bf4352c8ae98d3b6697eb3/README.md#init]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6487) Remove JobManager local mode
[ https://issues.apache.org/jira/browse/FLINK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001269#comment-16001269 ] Patrick Lucas commented on FLINK-6487: -- This appears to work: {code} #!/bin/bash JM_PID= TM_PID= function die() { kill $JM_PID $TM_PID } trap die SIGINT SIGTERM jobmanager.sh start-foreground cluster & JM_PID=$! taskmanager.sh start-foreground & TM_PID=$! wait $JM_PID $TM_PID wait $JM_PID $TM_PID {code} The double wait is intentional. When a trap fires it immediately returns all pending {{wait}}s, so you have to {{wait}} again on the same pids to actually block until they exit. There might be a nicer way to do that. > Remove JobManager local mode > > > Key: FLINK-6487 > URL: https://issues.apache.org/jira/browse/FLINK-6487 > Project: Flink > Issue Type: Bug > Components: JobManager >Reporter: Stephan Ewen > > We should remove the "local" mode from the JobManager. > Currently, the JobManager has the strange "local" mode where it also starts > an embedded Task Manager. > I think that mode has caused confusion / problems: > - No TaskManagers can join the cluster > - TaskManagers do not support querable state > - It is redundant code to maintain > At the same time, the mode does not help at all: > - The MiniCluster does not use that mode > - Starting from scripts, the {{start-cluster.sh}} works out of the box, > creating a proper local cluster, but in two processes, rather than one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6487) Remove JobManager local mode
[ https://issues.apache.org/jira/browse/FLINK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001228#comment-16001228 ] Patrick Lucas commented on FLINK-6487: -- I think we have to be cleverer (more clever?) than that—to propagate those signals in bash properly you need to exec (which of course you could only do once) or use trap... which might actually work okay. I'll give it a shot. Killing PID 1 with this approach as-is will orphan the cat and a bash that has two java child processes. > Remove JobManager local mode > > > Key: FLINK-6487 > URL: https://issues.apache.org/jira/browse/FLINK-6487 > Project: Flink > Issue Type: Bug > Components: JobManager >Reporter: Stephan Ewen > > We should remove the "local" mode from the JobManager. > Currently, the JobManager has the strange "local" mode where it also starts > an embedded Task Manager. > I think that mode has caused confusion / problems: > - No TaskManagers can join the cluster > - TaskManagers do not support querable state > - It is redundant code to maintain > At the same time, the mode does not help at all: > - The MiniCluster does not use that mode > - Starting from scripts, the {{start-cluster.sh}} works out of the box, > creating a proper local cluster, but in two processes, rather than one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6487) Remove JobManager local mode
[ https://issues.apache.org/jira/browse/FLINK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000980#comment-16000980 ] Patrick Lucas commented on FLINK-6487: -- Something like that should be fine. Even just a supervisord.conf that runs both with the right options could work, though unfortunately installing supervisord adds ~22MB to the image. We really don't need anything fancy, just something that spawns two processes, propagates signals to them properly, and multiplexes their stdouts into its own stdout. I wonder what the simplest approach for that would be. > Remove JobManager local mode > > > Key: FLINK-6487 > URL: https://issues.apache.org/jira/browse/FLINK-6487 > Project: Flink > Issue Type: Bug > Components: JobManager >Reporter: Stephan Ewen > > We should remove the "local" mode from the JobManager. > Currently, the JobManager has the strange "local" mode where it also starts > an embedded Task Manager. > I think that mode has caused confusion / problems: > - No TaskManagers can join the cluster > - TaskManagers do not support querable state > - It is redundant code to maintain > At the same time, the mode does not help at all: > - The MiniCluster does not use that mode > - Starting from scripts, the {{start-cluster.sh}} works out of the box, > creating a proper local cluster, but in two processes, rather than one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (FLINK-6487) Remove JobManager local mode
[ https://issues.apache.org/jira/browse/FLINK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000909#comment-16000909 ] Patrick Lucas edited comment on FLINK-6487 at 5/8/17 3:28 PM: -- One nice reason to have local mode is for docker, so users can, for example, run {{docker run flink local}} to get a local single-taskmanager cluster. (In [{{docker-entrypoint.sh}}|https://github.com/docker-flink/docker-flink/blob/e92920ff33107e6f01abaddf7b0195570f4d06e1/docker-entrypoint.sh#L56] this is currently turned into a call to {{jobmanager.sh start-foreground local}}). Right now, {{start-cluster.sh}} has the flaws that it doesn't have an option to run in the foreground and there is no real process management for the two processes it spawns. I think to properly replicate the behavior of {{jobmanager.sh start-foreground local}} we either need to inline something like [supervisord|http://supervisord.org/], making sure to multiplex [the processes' stdouts|https://github.com/Supervisor/supervisor/issues/511] or have the jobmanager fork (to keep the two-process paradigm). was (Author: plucas): One nice reason to have local mode is for docker, so users can, for example, run {{docker run flink local}} to get a local single-taskmanager cluster. (In [{{docker-entrypoint.sh}}|https://github.com/docker-flink/docker-flink/blob/e92920ff33107e6f01abaddf7b0195570f4d06e1/docker-entrypoint.sh#L56] this is turned into a call to {{jobmanager.sh start-foreground local}}. Right now, {{start-cluster.sh}} has the flaws that it doesn't have an option to run in the foreground and there is no real process management for the two processes it spawns. I think to properly replicate the behavior of {{jobmanager.sh start-foreground local}} we either need to inline something like [supervisord|http://supervisord.org/], making sure to multiplex [the processes' stdouts|https://github.com/Supervisor/supervisor/issues/511] or have the jobmanager fork (to keep the two-process paradigm). > Remove JobManager local mode > > > Key: FLINK-6487 > URL: https://issues.apache.org/jira/browse/FLINK-6487 > Project: Flink > Issue Type: Bug > Components: JobManager >Reporter: Stephan Ewen > > We should remove the "local" mode from the JobManager. > Currently, the JobManager has the strange "local" mode where it also starts > an embedded Task Manager. > I think that mode has caused confusion / problems: > - No TaskManagers can join the cluster > - TaskManagers do not support querable state > - It is redundant code to maintain > At the same time, the mode does not help at all: > - The MiniCluster does not use that mode > - Starting from scripts, the {{start-cluster.sh}} works out of the box, > creating a proper local cluster, but in two processes, rather than one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6487) Remove JobManager local mode
[ https://issues.apache.org/jira/browse/FLINK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000909#comment-16000909 ] Patrick Lucas commented on FLINK-6487: -- One nice reason to have local mode is for docker, so users can, for example, run {{docker run flink local}} to get a local single-taskmanager cluster. (In [{{docker-entrypoint.sh}}|https://github.com/docker-flink/docker-flink/blob/e92920ff33107e6f01abaddf7b0195570f4d06e1/docker-entrypoint.sh#L56] this is turned into a call to {{jobmanager.sh start-foreground local}}. Right now, {{start-cluster.sh}} has the flaws that it doesn't have an option to run in the foreground and there is no real process management for the two processes it spawns. I think to properly replicate the behavior of {{jobmanager.sh start-foreground local}} we either need to inline something like [supervisord|http://supervisord.org/], making sure to multiplex [the processes' stdouts|https://github.com/Supervisor/supervisor/issues/511] or have the jobmanager fork (to keep the two-process paradigm). > Remove JobManager local mode > > > Key: FLINK-6487 > URL: https://issues.apache.org/jira/browse/FLINK-6487 > Project: Flink > Issue Type: Bug > Components: JobManager >Reporter: Stephan Ewen > > We should remove the "local" mode from the JobManager. > Currently, the JobManager has the strange "local" mode where it also starts > an embedded Task Manager. > I think that mode has caused confusion / problems: > - No TaskManagers can join the cluster > - TaskManagers do not support querable state > - It is redundant code to maintain > At the same time, the mode does not help at all: > - The MiniCluster does not use that mode > - Starting from scripts, the {{start-cluster.sh}} works out of the box, > creating a proper local cluster, but in two processes, rather than one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5460) Add documentation how to use Flink with Docker
[ https://issues.apache.org/jira/browse/FLINK-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-5460. Resolution: Duplicate Fix Version/s: (was: 1.3.0) (was: 1.2.0) > Add documentation how to use Flink with Docker > -- > > Key: FLINK-5460 > URL: https://issues.apache.org/jira/browse/FLINK-5460 > Project: Flink > Issue Type: Sub-task > Components: Docker, Documentation >Reporter: Stephan Ewen > > {{docs/setup/docker.md}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6399) Update default Hadoop download version
[ https://issues.apache.org/jira/browse/FLINK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996641#comment-15996641 ] Patrick Lucas commented on FLINK-6399: -- This duplicates FLINK-6021. (feel free to close either in favor of the other) > Update default Hadoop download version > -- > > Key: FLINK-6399 > URL: https://issues.apache.org/jira/browse/FLINK-6399 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Greg Hogan >Assignee: mingleizhang > > [Update|http://flink.apache.org/downloads.html] "If you don’t want to do > this, pick the Hadoop 1 version." since Hadoop 1 versions are no longer > provided. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5460) Add documentation how to use Flink with Docker
[ https://issues.apache.org/jira/browse/FLINK-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-5460: - Component/s: Docker > Add documentation how to use Flink with Docker > -- > > Key: FLINK-5460 > URL: https://issues.apache.org/jira/browse/FLINK-5460 > Project: Flink > Issue Type: Sub-task > Components: Docker, Documentation >Reporter: Stephan Ewen > Fix For: 1.2.0, 1.3.0 > > > {{docs/setup/docker.md}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6369) Better support for overlay networks
[ https://issues.apache.org/jira/browse/FLINK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981409#comment-15981409 ] Patrick Lucas commented on FLINK-6369: -- I think there are a number of solutions that fix these problems without compromising any preexisting SSL support within Flink. For instance, having Taskmanagers register their IP with the Jobmanager by default (afaik like it was in a pretty recent version of Flink) but allowing that to be overridden. Also, relying on SSL hostname verification in an environment like Kubernetes seems pretty dubious. Even if you have a wildcard cert to use, you'd have to mount or ship the cert and key to _every Flink container_. At that point, why not just also include a trust store that contained only a single cert the whole cluster uses, and ignore hostname verification? For external communication with the cluster: absolutely. It totally makes sense to have a "real" cert in front of the web UI, for instance, but then it's up to you to have the CN/alt names set up appropriately. But for intra-cluster communication I think that it should be possible and supported to use hostname-valid certs, but the default behavior should keep things simple for the vastly more common case of SSL-lessness. > Better support for overlay networks > --- > > Key: FLINK-6369 > URL: https://issues.apache.org/jira/browse/FLINK-6369 > Project: Flink > Issue Type: Improvement > Components: Docker, Network >Affects Versions: 1.2.0 >Reporter: Patrick Lucas > Fix For: 1.3.1 > > > Running Flink in an environment that utilizes an overlay network > (containerized environments like Kubernetes or Docker Compose, or cloud > platforms like AWS VPC) poses various challenges related to networking. > The core problem is that in these environments, applications are frequently > addressed by a name different from that with which the application sees > itself. > For instance, it is plausible that the Flink UI (served by the Jobmanager) is > accessed via an ELB, which poses a problem in HA mode when the non-leader UI > returns an HTTP redirect to the leader—but the user may not be able to > connect directly to the leader. > Or, if a user is using [Docker > Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml], > they cannot submit a job via the CLI since there is a mismatch between the > name used to address the Jobmanager and what the Jobmanager perceives as its > hostname. (see \[1] below for more detail) > > h3. Problems and proposed solutions > There are four instances of this issue that I've run into so far: > h4. Jobmanagers must be addressed by the same name they are configured with > due to limitations of Akka > Akka enforces that messages it receives are addressed with the hostname it is > configured with. Newer versions of Akka (>= 2.4) than what Flink uses > (2.3-custom) have support for accepting messages with the "wrong" hostname, > but it limited to a single "external" hostname. > In many environments, it is likely that not all parties that want to connect > to the Jobmanager have the same way of addressing it (e.g. the ELB example > above). Other similarly-used protocols like HTTP generally don't have this > restriction: if you connect on a socket and send a well-formed message, the > system assumes that it is the desired recipient. > One solution is to not use Akka at all when communicating with the cluster > from the outside, perhaps using an HTTP API instead. This would be somewhat > involved, and probabyl best left as a longer-term goal. > A more immediate solution would be to override this behavior within Flakka, > the custom fork of Akka currently in use by Flink. I'm not sure how much > effort this would take. > h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. > metrics collection > Having the Taskmanagers register themselves by IP is probably the best > solution here. It's a reasonable assumption that IPs can always be used for > communication between the nodes of a single cluster. Asking that each > Taskmanager container have a resolvable hostname is unreasonable. > h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally > resolvable/routable > If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders > (such as if you put a Kubernetes Service in front of all Jobmanagers in a > cluster) get redirected to the (supposed) hostname of the leader, but this is > potentially unresolvable/unroutable externally. > Enabling non-leader Jobmanagers to proxy API calls to the leader would solve > this. The non-leaders could even serve static asset requests (e.g. for css or > js files) directly. > h4. Queryable state
[jira] [Created] (FLINK-6369) Better support for overlay networks
Patrick Lucas created FLINK-6369: Summary: Better support for overlay networks Key: FLINK-6369 URL: https://issues.apache.org/jira/browse/FLINK-6369 Project: Flink Issue Type: Improvement Components: Docker, Network Affects Versions: 1.2.0 Reporter: Patrick Lucas Running Flink in an environment that utilizes an overlay network (containerized environments like Kubernetes or Docker Compose, or cloud platforms like AWS VPC) poses various challenges related to networking. The core problem is that in these environments, applications are frequently addressed by a name different from that with which the application sees itself. For instance, it is plausible that the Flink UI (served by the Jobmanager) is accessed via an ELB, which poses a problem in HA mode when the non-leader UI returns an HTTP redirect to the leader—but the user may not be able to connect directly to the leader. Or, if a user is using [Docker Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml], they cannot submit a job via the CLI since there is a mismatch between the name used to address the Jobmanager and what the Jobmanager perceives as its hostname. (see \[1] below for more detail) h3. Problems and proposed solutions There are four instances of this issue that I've run into so far: h4. Jobmanagers must be addressed by the same name they are configured with due to limitations of Akka Akka enforces that messages it receives are addressed with the hostname it is configured with. Newer versions of Akka (>= 2.4) than what Flink uses (2.3-custom) have support for accepting messages with the "wrong" hostname, but it limited to a single "external" hostname. In many environments, it is likely that not all parties that want to connect to the Jobmanager have the same way of addressing it (e.g. the ELB example above). Other similarly-used protocols like HTTP generally don't have this restriction: if you connect on a socket and send a well-formed message, the system assumes that it is the desired recipient. One solution is to not use Akka at all when communicating with the cluster from the outside, perhaps using an HTTP API instead. This would be somewhat involved, and probabyl best left as a longer-term goal. A more immediate solution would be to override this behavior within Flakka, the custom fork of Akka currently in use by Flink. I'm not sure how much effort this would take. h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. metrics collection Having the Taskmanagers register themselves by IP is probably the best solution here. It's a reasonable assumption that IPs can always be used for communication between the nodes of a single cluster. Asking that each Taskmanager container have a resolvable hostname is unreasonable. h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally resolvable/routable If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders (such as if you put a Kubernetes Service in front of all Jobmanagers in a cluster) get redirected to the (supposed) hostname of the leader, but this is potentially unresolvable/unroutable externally. Enabling non-leader Jobmanagers to proxy API calls to the leader would solve this. The non-leaders could even serve static asset requests (e.g. for css or js files) directly. h4. Queryable state requests involve direct communication with Taskmanagers Currently, queryable state requests involve communication between the client and the Jobmanager (for key partitioning lookups) and between the client and all Taskmanagers. If the client is inside the network (as would be common in production use-cases where high-volume lookups are required) this is a non-issue, but problems crop up if the client is outside the network. For the communication with the Jobmanager, a similar solution as above can be used: if all Jobmanagers can service all key partitioning lookup requests (e.g. by proxying) then a simple Service can be used. The story is a bit different for the Taskmanagers. The partitioning lookup to the Jobmanager would return the name of the particular Taskmanager that owned the desired data, but that name (likely an IP, as proposed in the second section above) is not necessarily resolvable/routable from the client. In the context of Kubernetes, where individual containers are generally not addressible, a very ugly solution would involve creating a Service for each Taskmanager, then cleverly configuring things such that the same name could be used to address a specific Taskmanager both inside and outside the network. \[2] A much nicer solution would be, like in the previous section, to enable Taskmanagers to proxy any queryable state lookup to the appropriate
[jira] [Updated] (FLINK-6369) Better support for overlay networks
[ https://issues.apache.org/jira/browse/FLINK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-6369: - Fix Version/s: 1.3.1 > Better support for overlay networks > --- > > Key: FLINK-6369 > URL: https://issues.apache.org/jira/browse/FLINK-6369 > Project: Flink > Issue Type: Improvement > Components: Docker, Network >Affects Versions: 1.2.0 >Reporter: Patrick Lucas > Fix For: 1.3.1 > > > Running Flink in an environment that utilizes an overlay network > (containerized environments like Kubernetes or Docker Compose, or cloud > platforms like AWS VPC) poses various challenges related to networking. > The core problem is that in these environments, applications are frequently > addressed by a name different from that with which the application sees > itself. > For instance, it is plausible that the Flink UI (served by the Jobmanager) is > accessed via an ELB, which poses a problem in HA mode when the non-leader UI > returns an HTTP redirect to the leader—but the user may not be able to > connect directly to the leader. > Or, if a user is using [Docker > Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml], > they cannot submit a job via the CLI since there is a mismatch between the > name used to address the Jobmanager and what the Jobmanager perceives as its > hostname. (see \[1] below for more detail) > > h3. Problems and proposed solutions > There are four instances of this issue that I've run into so far: > h4. Jobmanagers must be addressed by the same name they are configured with > due to limitations of Akka > Akka enforces that messages it receives are addressed with the hostname it is > configured with. Newer versions of Akka (>= 2.4) than what Flink uses > (2.3-custom) have support for accepting messages with the "wrong" hostname, > but it limited to a single "external" hostname. > In many environments, it is likely that not all parties that want to connect > to the Jobmanager have the same way of addressing it (e.g. the ELB example > above). Other similarly-used protocols like HTTP generally don't have this > restriction: if you connect on a socket and send a well-formed message, the > system assumes that it is the desired recipient. > One solution is to not use Akka at all when communicating with the cluster > from the outside, perhaps using an HTTP API instead. This would be somewhat > involved, and probabyl best left as a longer-term goal. > A more immediate solution would be to override this behavior within Flakka, > the custom fork of Akka currently in use by Flink. I'm not sure how much > effort this would take. > h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. > metrics collection > Having the Taskmanagers register themselves by IP is probably the best > solution here. It's a reasonable assumption that IPs can always be used for > communication between the nodes of a single cluster. Asking that each > Taskmanager container have a resolvable hostname is unreasonable. > h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally > resolvable/routable > If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders > (such as if you put a Kubernetes Service in front of all Jobmanagers in a > cluster) get redirected to the (supposed) hostname of the leader, but this is > potentially unresolvable/unroutable externally. > Enabling non-leader Jobmanagers to proxy API calls to the leader would solve > this. The non-leaders could even serve static asset requests (e.g. for css or > js files) directly. > h4. Queryable state requests involve direct communication with Taskmanagers > Currently, queryable state requests involve communication between the client > and the Jobmanager (for key partitioning lookups) and between the client and > all Taskmanagers. > If the client is inside the network (as would be common in production > use-cases where high-volume lookups are required) this is a non-issue, but > problems crop up if the client is outside the network. > For the communication with the Jobmanager, a similar solution as above can be > used: if all Jobmanagers can service all key partitioning lookup requests > (e.g. by proxying) then a simple Service can be used. > The story is a bit different for the Taskmanagers. The partitioning lookup to > the Jobmanager would return the name of the particular Taskmanager that owned > the desired data, but that name (likely an IP, as proposed in the second > section above) is not necessarily resolvable/routable from the client. > In the context of Kubernetes, where individual containers are generally not > addressible, a very ugly solution would involve creating a Service
[jira] [Created] (FLINK-6330) Improve Docker documentation
Patrick Lucas created FLINK-6330: Summary: Improve Docker documentation Key: FLINK-6330 URL: https://issues.apache.org/jira/browse/FLINK-6330 Project: Flink Issue Type: Bug Components: Docker Affects Versions: 1.2.0 Reporter: Patrick Lucas Assignee: Patrick Lucas Fix For: 1.2.2 The "Docker" page in the docs exists but is blank. Add something useful here, including references to the official images that should exist once 1.2.1 is released, and add a brief "Kubernetes" page as well, referencing the [helm chart|https://github.com/docker-flink/examples]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6322) Mesos task labels
[ https://issues.apache.org/jira/browse/FLINK-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973018#comment-15973018 ] Patrick Lucas commented on FLINK-6322: -- This could likely be modeled similarly to `yarn.tags`. > Mesos task labels > - > > Key: FLINK-6322 > URL: https://issues.apache.org/jira/browse/FLINK-6322 > Project: Flink > Issue Type: Improvement > Components: Mesos >Reporter: Eron Wright >Priority: Minor > > Task labels serve many purposes in Mesos, such a tagging tasks for > log-aggregation purposes. > I propose a new configuration setting for a list of 'key=value' labels to be > applied to TM instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6300) PID1 of docker images does not behave correctly
[ https://issues.apache.org/jira/browse/FLINK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-6300: - Affects Version/s: (was: 2.0.0) 1.2.0 > PID1 of docker images does not behave correctly > --- > > Key: FLINK-6300 > URL: https://issues.apache.org/jira/browse/FLINK-6300 > Project: Flink > Issue Type: Bug > Components: Docker >Affects Versions: 1.2.0, 1.1.4 > Environment: all >Reporter: kathleen sharp >Assignee: Patrick Lucas >Priority: Minor > > When running the task manager and job manager docker images the process with > PID1 is a bash script. > There is a problem in using bash as the PID1 process in a docker > container as docker sends SIGTERM, but bash doesn't send this to its > child processes. > This means for example that if a container was ever killed and a child > process had a file open then the file may get corrupted. > It's covered in more detail in a blog post here: > https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ > From the mailing list (Nico): > "Some background: > Although docker-entrypoint.sh uses "exec" to run succeeding bash scripts for > jobmanager.sh and taskmanager.sh, respectively, and thus replaces itself with > these scripts, they do not seem to use exec themselves for foreground > processes and thus may run into the problem you described. > I may be wrong, but I did not find any other fallback to handle this in the > current code base." > Potentially useful information: > dockerd version 1.1.3 added an init flag: > "You can use the --init flag to indicate that an init process should be used > as the PID 1 in the container. Specifying an init process ensures the usual > responsibilities of an init system, such as reaping zombie processes, are > performed inside the created container." > from: > https://docs.docker.com/engine/reference/run/#restart-policies---restart > perhaps the fix could be just to update readme for these images to specify to > use this flag. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6300) PID1 of docker images does not behave correctly
[ https://issues.apache.org/jira/browse/FLINK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-6300: Assignee: Patrick Lucas > PID1 of docker images does not behave correctly > --- > > Key: FLINK-6300 > URL: https://issues.apache.org/jira/browse/FLINK-6300 > Project: Flink > Issue Type: Bug > Components: Docker >Affects Versions: 2.0.0, 1.1.4 > Environment: all >Reporter: kathleen sharp >Assignee: Patrick Lucas >Priority: Minor > > When running the task manager and job manager docker images the process with > PID1 is a bash script. > There is a problem in using bash as the PID1 process in a docker > container as docker sends SIGTERM, but bash doesn't send this to its > child processes. > This means for example that if a container was ever killed and a child > process had a file open then the file may get corrupted. > It's covered in more detail in a blog post here: > https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ > From the mailing list (Nico): > "Some background: > Although docker-entrypoint.sh uses "exec" to run succeeding bash scripts for > jobmanager.sh and taskmanager.sh, respectively, and thus replaces itself with > these scripts, they do not seem to use exec themselves for foreground > processes and thus may run into the problem you described. > I may be wrong, but I did not find any other fallback to handle this in the > current code base." > Potentially useful information: > dockerd version 1.1.3 added an init flag: > "You can use the --init flag to indicate that an init process should be used > as the PID 1 in the container. Specifying an init process ensures the usual > responsibilities of an init system, such as reaping zombie processes, are > performed inside the created container." > from: > https://docs.docker.com/engine/reference/run/#restart-policies---restart > perhaps the fix could be just to update readme for these images to specify to > use this flag. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-6308) Task managers are not attaching to job manager on macos
[ https://issues.apache.org/jira/browse/FLINK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-6308. Resolution: Fixed Fix Version/s: 1.2.1 `docker-compose.yml` is already fixed in master, and [~iemejia] has a branch to remove bluemix docker compose file and related references. > Task managers are not attaching to job manager on macos > --- > > Key: FLINK-6308 > URL: https://issues.apache.org/jira/browse/FLINK-6308 > Project: Flink > Issue Type: Bug > Components: Docker >Reporter: Mateusz Zakarczemny > Fix For: 1.2.1 > > > I'm using flink 1.2.0. On macOS task managers were not joining cluster. I was > able to fix that by removing quotes from docker-compose: > {code} > diff --git a/flink-contrib/docker-flink/docker-compose-bluemix.yml > b/flink-contrib/docker-flink/docker-compose-bluemix.yml > index b667a0d89a..55c161766a 100644 > --- a/flink-contrib/docker-flink/docker-compose-bluemix.yml > +++ b/flink-contrib/docker-flink/docker-compose-bluemix.yml > @@ -23,11 +23,11 @@ services: >jobmanager: > #image example: registry.eu-gb.bluemix.net/markussaidhiho/flink > image: ${IMAGENAME} > -container_name: "jobmanager" > +container_name: jobmanager > expose: > - - "6123" > + - 6123" > ports: > - - "48081:8081" > + - 48081:8081 > command: jobmanager >taskmanager: > @@ -41,4 +41,4 @@ services: > links: >- "jobmanager:jobmanager" > environment: > - - JOB_MANAGER_RPC_ADDRESS="jobmanager" > + - JOB_MANAGER_RPC_ADDRESS=jobmanager > diff --git a/flink-contrib/docker-flink/docker-compose.yml > b/flink-contrib/docker-flink/docker-compose.yml > index 6a1335360d..0a644681f3 100644 > --- a/flink-contrib/docker-flink/docker-compose.yml > +++ b/flink-contrib/docker-flink/docker-compose.yml > @@ -27,7 +27,7 @@ services: >- "48081:8081" > command: jobmanager > environment: > - - JOB_MANAGER_RPC_ADDRESS="jobmanager" > + - JOB_MANAGER_RPC_ADDRESS=jobmanager >taskmanager: > image: flink > @@ -40,4 +40,4 @@ services: > links: >- "jobmanager:jobmanager" > environment: > - - JOB_MANAGER_RPC_ADDRESS="jobmanager" > + - JOB_MANAGER_RPC_ADDRESS=jobmanager > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6194) More broken links in docs
Patrick Lucas created FLINK-6194: Summary: More broken links in docs Key: FLINK-6194 URL: https://issues.apache.org/jira/browse/FLINK-6194 Project: Flink Issue Type: Bug Reporter: Patrick Lucas Assignee: Patrick Lucas My script noticed a few broken links that made it into the docs. I'll fix them up. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5801) Queryable State from Scala job/client fails with key of type Long
[ https://issues.apache.org/jira/browse/FLINK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas closed FLINK-5801. Resolution: Not A Bug > Queryable State from Scala job/client fails with key of type Long > - > > Key: FLINK-5801 > URL: https://issues.apache.org/jira/browse/FLINK-5801 > Project: Flink > Issue Type: Bug > Components: Queryable State >Affects Versions: 1.2.0 > Environment: Flink 1.2.0 > Scala 2.10 >Reporter: Patrick Lucas >Assignee: Ufuk Celebi > Attachments: OrderFulfillment.scala, OrderFulfillmentStateQuery.scala > > > While working on a demo Flink job, to try out Queryable State, I exposed some > state of type Long -> custom class via the Query server. However, the query > server returned an exception when I tried to send a query: > {noformat} > Exception in thread "main" java.lang.RuntimeException: Failed to query state > backend for query 0. Caused by: java.io.IOException: Unable to deserialize > key and namespace. This indicates a mismatch in the key/namespace serializers > used by the KvState instance and this access. > at > org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:392) > at > org.apache.flink.runtime.state.heap.AbstractHeapState.getSerializedValue(AbstractHeapState.java:130) > at > org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:220) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException > at > org.apache.flink.runtime.util.DataInputDeserializer.readLong(DataInputDeserializer.java:217) > at > org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:69) > at > org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:27) > at > org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:379) > ... 7 more > at > org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:257) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I banged my head against this for a while, then per [~jgrier]'s suggestion I > tried simply changing the key from Long to String (modifying the two > {{keyBy}} calls and the {{keySerializer}} {{TypeHint}} in the attached code) > and it started working perfectly. > cc [~uce] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (FLINK-5634) Flink should not always redirect stdout to a file.
[ https://issues.apache.org/jira/browse/FLINK-5634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas resolved FLINK-5634. -- Resolution: Duplicate Closing as essentially a dupe of FLINK-4326. > Flink should not always redirect stdout to a file. > -- > > Key: FLINK-5634 > URL: https://issues.apache.org/jira/browse/FLINK-5634 > Project: Flink > Issue Type: Bug > Components: Docker >Affects Versions: 1.2.0 >Reporter: Jamie Grier >Assignee: Jamie Grier > > Flink always redirects stdout to a file. While often convenient this isn't > always what people want. The most obvious case of this is a Docker > deployment. > It should be possible to have Flink log to stdout. > Here is a PR for this: https://github.com/apache/flink/pull/3204 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923956#comment-15923956 ] Patrick Lucas commented on FLINK-3026: -- Personally I like the idea of having as little code as possible in the docker-entrypoint.sh script. I think it should be designed to basically work in all Flink versions going forward. I suppose that would mean that we should have some scripts on the PATH named 'jobmanager' and 'taskmanager' so that docker-entrypoint.sh can effectively just be: {code} #!/bin/bash -e # Drop root privs code here exec "$@" {code} And absolutely these builds will be automated—see my note in README.md about generating the .travis.yml config. I just built some images as a proof of concept and to think about how tagging will work. We should include a script like the generate-stackbrew-library.sh one that many official Dockerfile repos include to generate the file that is referenced by the [docker-library/official-images|https://github.com/docker-library/official-images/] repo to actually build and tag all the right permutations. And regarding this repo living my personal account, I consider that just a temporary arrangement until we can get these included in docker-library. You're welcome to clone the repo if you want to experiment with kicking off builds. Finally, could you expound a little more about your opinion about the relationship between a Dockerfile included in apache/flink and these official images? I'm not sure if I see the benefit of trying to keep it around if we have official, publicly available, well-documented reference Dockerfiles and images available. The one argument I see is building a Docker image from your work tree if you're a developer, but in that case I think it's a lot less important to maintain it very actively. It could even just be a Dockerfile that bases itself on the official image and then replaces the Flink release within with your work tree! > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906494#comment-15906494 ] Patrick Lucas edited comment on FLINK-3026 at 3/12/17 11:31 AM: I've pushed some of the images built from the Dockerfiles in that repo @[aafb1e4b|https://github.com/patricklucas/docker-flink/tree/aafb1e4bc4a1d6d1fe706cbe457376385e3f8d02] (after which were some changes by [~iemejia] that I want to discuss) to the Docker Hub repo [plucas/flink|https://hub.docker.com/r/plucas/flink/tags/], along with the relevant tags. Specifically, I pushed: * 1.1.4-hadoop27-scala_2.11-alpine * 1.2.0-hadoop26-scala_2.10-alpine * 1.2.0-hadoop26-scala_2.11-alpine * 1.2.0-hadoop27-scala_2.10-alpine * 1.2.0-hadoop27-scala_2.11-alpine to demonstrate various tag permutations, e.g.: * latest-alpine -> 1.2.0-hadoop27-scala_2.11-alpine * 1.1-alpine -> 1.1.4-hadoop27-scala_2.11-alpine * 1.2-hadoop26-alpine -> 1.2.0-hadoop26-scala_2.11-alpine * 1.2-scala_2.10-alpine -> 1.2.0-hadoop27-scala_2.10-alpine was (Author: plucas): I've pushed some of the images built from the Dockerfiles in that repo @[aafb1e4b|https://github.com/patricklucas/docker-flink/commit/aafb1e4bc4a1d6d1fe706cbe457376385e3f8d02] (after which were some changes by [~iemejia] that I want to discuss) to the Docker Hub repo [plucas/flink|https://hub.docker.com/r/plucas/flink/tags/], along with the relevant tags. Specifically, I pushed: * 1.1.4-hadoop27-scala_2.11-alpine * 1.2.0-hadoop26-scala_2.10-alpine * 1.2.0-hadoop26-scala_2.11-alpine * 1.2.0-hadoop27-scala_2.10-alpine * 1.2.0-hadoop27-scala_2.11-alpine to demonstrate various tag permutations, e.g.: * latest-alpine -> 1.2.0-hadoop27-scala_2.11-alpine * 1.1-alpine -> 1.1.4-hadoop27-scala_2.11-alpine * 1.2-hadoop26-alpine -> 1.2.0-hadoop26-scala_2.11-alpine * 1.2-scala_2.10-alpine -> 1.2.0-hadoop27-scala_2.10-alpine > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906494#comment-15906494 ] Patrick Lucas commented on FLINK-3026: -- I've pushed some of the images built from the Dockerfiles in that repo @[aafb1e4b|https://github.com/patricklucas/docker-flink/commit/aafb1e4bc4a1d6d1fe706cbe457376385e3f8d02] (after which were some changes by [~iemejia] that I want to discuss) to the Docker Hub repo [plucas/flink|https://hub.docker.com/r/plucas/flink/tags/], along with the relevant tags. Specifically, I pushed: * 1.1.4-hadoop27-scala_2.11-alpine * 1.2.0-hadoop26-scala_2.10-alpine * 1.2.0-hadoop26-scala_2.11-alpine * 1.2.0-hadoop27-scala_2.10-alpine * 1.2.0-hadoop27-scala_2.11-alpine to demonstrate various tag permutations, e.g.: * latest-alpine -> 1.2.0-hadoop27-scala_2.11-alpine * 1.1-alpine -> 1.1.4-hadoop27-scala_2.11-alpine * 1.2-hadoop26-alpine -> 1.2.0-hadoop26-scala_2.11-alpine * 1.2-scala_2.10-alpine -> 1.2.0-hadoop27-scala_2.10-alpine > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6021) Downloads page references "Hadoop 1 version" which isn't an option
Patrick Lucas created FLINK-6021: Summary: Downloads page references "Hadoop 1 version" which isn't an option Key: FLINK-6021 URL: https://issues.apache.org/jira/browse/FLINK-6021 Project: Flink Issue Type: Bug Components: Project Website Reporter: Patrick Lucas The downloads pages says {quote} Apache Flink® 1.2.0 is our latest stable release. You don’t have to install Hadoop to use Flink, but if you plan to use Flink with data stored in Hadoop, pick the version matching your installed Hadoop version. If you don’t want to do this, pick the Hadoop 1 version. {quote} But Hadoop 1 appears to no longer be an available alternative. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903317#comment-15903317 ] Patrick Lucas commented on FLINK-3026: -- Yes, I gave you contributor permissions. This repo is really just a stop-gap until we find a permanent home for these scripts, such as in docker-library. I'm fairly convinced that these should be versioned independently of Flink (in a separate repo), as they are updated as a consequence of Flink releases and are not themselves part of a given release. Moreover, the scripts herein may relate to many different "stable" branches of Flink. And I don't really see how it would be feasible to make the contents of this repo somehow "downstream" from the Docker support included in apache/flink as I think they serve different purposes. In apache/flink, having a Dockerfile to package up the current tree would be useful for development, and as a baseline if a user wanted to create their own images from scratch. The docker-flink repo meanwhile encodes the generation of all Dockerfile variants for all "stable" versions of Flink. Finally, I think a tenet of this work (Docker support) going forward should be that the Dockerfile templates really should not change much over time. Instead, we should try to make any changes in functionality in Flink itself, and keep the Dockerfiles (and docker-entrypoint.sh scripts) as simple as possible. If we're doing things right, this code should very rarely change, except when we need to generate Dockerfiles for a new Flink release. > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903162#comment-15903162 ] Patrick Lucas commented on FLINK-3026: -- The new [docker-flink|https://github.com/patricklucas/docker-flink/] repo is up. > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry
[ https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902792#comment-15902792 ] Patrick Lucas commented on FLINK-3026: -- Following from discussion on [#3494|https://github.com/apache/flink/pull/3494], it's apparent that more work is required to provide official Flink Docker images than simply improving the Dockerfile in the Flink repo. I explored the official Docker Hub repos and looked at how the first five Apache projects I saw, [httpd|https://hub.docker.com/_/httpd/], [tomcat|https://hub.docker.com/_/tomcat/], [cassandra|https://hub.docker.com/_/cassandra/], [maven|https://hub.docker.com/_/maven/], and [solr|https://hub.docker.com/_/solr/], have their Dockerfiles hosted. Common between all of them is that they all have a dedicated git repo to host the Dockerfiles, and none are hosted by the Apache project. Three ([httpd|https://github.com/docker-library/httpd], [tomcat|https://github.com/docker-library/tomcat], and [cassandra|https://github.com/docker-library/cassandra]) are hosted within the [docker-library|https://github.com/docker-library] GitHub org, one ([solr|https://github.com/docker-solr/docker-solr]) has a dedicated GitHub org ([docker-solr|https://github.com/docker-solr]), and one ([maven|https://github.com/carlossg/docker-maven]) is hosted within a Maven contributor and ASF member's GitHub account ([carlossg|https://github.com/carlossg]). Adding Flink to the docker-library repo is thus the route most inline with these projects, though there may be some implications I'm not aware of. In the mean time, I will create a repo under [my own GitHub account|https://github.com/patricklucas] using the Apache license called docker-flink and grant [~iemejia] commit access so we can work on the Dockerfiles and supporting scripts. Does this sound like a good starting point [~iemejia] & [~jgrier]? > Publish the flink docker container to the docker registry > - > > Key: FLINK-3026 > URL: https://issues.apache.org/jira/browse/FLINK-3026 > Project: Flink > Issue Type: Task > Components: Build System, Docker >Reporter: Omer Katz >Assignee: Patrick Lucas > Labels: Deployment, Docker > > There's a dockerfile that can be used to build a docker container already in > the repository. It'd be awesome to just be able to pull it instead of > building it ourselves. > The dockerfile can be found at > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > It also doesn't point to the latest version of Flink which I fixed in > https://github.com/apache/flink/pull/1366 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5997) Docker build tools should inline GPG key to verify HTTP Flink release download
[ https://issues.apache.org/jira/browse/FLINK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-5997: - Description: See examples [here|https://github.com/docker-solr/docker-solr/blob/8afee8bf17a7a3d1fdc453643abfdb5cf2e822d0/6.4/Dockerfile] and [here|https://github.com/docker-library/official-images/blob/8b10b36d19a8cfba56bf4352c8ae98d3b6697eb3/README.md#init]. (was: See title) > Docker build tools should inline GPG key to verify HTTP Flink release download > -- > > Key: FLINK-5997 > URL: https://issues.apache.org/jira/browse/FLINK-5997 > Project: Flink > Issue Type: Improvement > Components: Docker >Reporter: Patrick Lucas >Assignee: Patrick Lucas > > See examples > [here|https://github.com/docker-solr/docker-solr/blob/8afee8bf17a7a3d1fdc453643abfdb5cf2e822d0/6.4/Dockerfile] > and > [here|https://github.com/docker-library/official-images/blob/8b10b36d19a8cfba56bf4352c8ae98d3b6697eb3/README.md#init]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5997) Docker build tools should inline GPG key to verify HTTP Flink release download
Patrick Lucas created FLINK-5997: Summary: Docker build tools should inline GPG key to verify HTTP Flink release download Key: FLINK-5997 URL: https://issues.apache.org/jira/browse/FLINK-5997 Project: Flink Issue Type: Improvement Components: Docker Reporter: Patrick Lucas Assignee: Patrick Lucas See title -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5635) Improve Docker tooling to make it easier to build images and launch Flink via Docker tools
[ https://issues.apache.org/jira/browse/FLINK-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-5635: Assignee: Patrick Lucas (was: Jamie Grier) > Improve Docker tooling to make it easier to build images and launch Flink via > Docker tools > -- > > Key: FLINK-5635 > URL: https://issues.apache.org/jira/browse/FLINK-5635 > Project: Flink > Issue Type: Improvement > Components: Docker >Affects Versions: 1.2.0 >Reporter: Jamie Grier >Assignee: Patrick Lucas > > This is a bit of a catch-all ticket for general improvements to the Flink on > Docker experience. > Things to improve: > - Make it possible to build a Docker image from your own flink-dist > directory as well as official releases. > - Make it possible to override the image name so a user can more easily > publish these images to their Docker repository > - Provide scripts that show how to properly run on Docker Swarm or similar > environments with overlay networking (Kubernetes) without using host > networking. > - Log to stdout rather than to files. > - Work properly with docker-compose for local deployment as well as > production deployments (Swarm/k8s) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground
[ https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-4326: Assignee: Greg Hogan > Flink start-up scripts should optionally start services on the foreground > - > > Key: FLINK-4326 > URL: https://issues.apache.org/jira/browse/FLINK-4326 > Project: Flink > Issue Type: Improvement > Components: Startup Shell Scripts >Affects Versions: 1.0.3 >Reporter: Elias Levy >Assignee: Greg Hogan > > This has previously been mentioned in the mailing list, but has not been > addressed. Flink start-up scripts start the job and task managers in the > background. This makes it difficult to integrate Flink with most processes > supervisory tools and init systems, including Docker. One can get around > this via hacking the scripts or manually starting the right classes via Java, > but it is a brittle solution. > In addition to starting the daemons in the foreground, the start up scripts > should use exec instead of running the commends, so as to avoid forks. Many > supervisory tools assume the PID of the process to be monitored is that of > the process it first executes, and fork chains make it difficult for the > supervisor to figure out what process to monitor. Specifically, > jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and > flink-daemon.sh should exec java. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground
[ https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas resolved FLINK-4326. -- Resolution: Fixed > Flink start-up scripts should optionally start services on the foreground > - > > Key: FLINK-4326 > URL: https://issues.apache.org/jira/browse/FLINK-4326 > Project: Flink > Issue Type: Improvement > Components: Startup Shell Scripts >Affects Versions: 1.0.3 >Reporter: Elias Levy >Assignee: Greg Hogan > > This has previously been mentioned in the mailing list, but has not been > addressed. Flink start-up scripts start the job and task managers in the > background. This makes it difficult to integrate Flink with most processes > supervisory tools and init systems, including Docker. One can get around > this via hacking the scripts or manually starting the right classes via Java, > but it is a brittle solution. > In addition to starting the daemons in the foreground, the start up scripts > should use exec instead of running the commends, so as to avoid forks. Many > supervisory tools assume the PID of the process to be monitored is that of > the process it first executes, and fork chains make it difficult for the > supervisor to figure out what process to monitor. Specifically, > jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and > flink-daemon.sh should exec java. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5843) Website/docs missing Cache-Control HTTP header, can serve stale data
[ https://issues.apache.org/jira/browse/FLINK-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-5843: Assignee: Patrick Lucas Priority: Minor (was: Major) > Website/docs missing Cache-Control HTTP header, can serve stale data > > > Key: FLINK-5843 > URL: https://issues.apache.org/jira/browse/FLINK-5843 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Patrick Lucas >Assignee: Patrick Lucas >Priority: Minor > > When Flink 1.2.0 was released, I found that the [Flink downloads > page|https://flink.apache.org/downloads.html] was out-of-date until I forced > my browser to refresh the page. Upon investigation, I found that the > principle pages of the website are served with only the following headers > that relate to caching: Date, Last-Modified, and ETag. > Since there is no Cache-Control header (or the older Expires or Pragma > headers), browsers are left to their own heuristics as to how long to cache > this content, which varies browser to browser. In some browsers, this > heuristic is 10% of the difference between Date and Last-Modified headers. I > take this to mean that, if the content were last modified 90 days ago, and I > last accessed it 5 days ago, my browser will serve a cached response for a > further 3.5 days (10% * (90 days - 5 days) = 8.5 days, 5 days have elapsed > leaving 3.5 days). > I'm not sure who at the ASF we should talk to about this, but I recommend we > add the following header to any responses served from the Flink project > website or official documentation website\[1]: > {code}Cache-Control: max-age=0, must-revalidate{code} > (Note this will only make browser revalidate their caches; if the ETag of the > cached content matches what the server still has, the server will return 304 > Not Modified and omit the actual content) > \[1] Both the website hosted at flink.apache.org and the documentation hosted > at ci.apache.org are affected. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5843) Website/docs missing Cache-Control HTTP header, can serve stale data
Patrick Lucas created FLINK-5843: Summary: Website/docs missing Cache-Control HTTP header, can serve stale data Key: FLINK-5843 URL: https://issues.apache.org/jira/browse/FLINK-5843 Project: Flink Issue Type: Bug Components: Project Website Reporter: Patrick Lucas When Flink 1.2.0 was released, I found that the [Flink downloads page|https://flink.apache.org/downloads.html] was out-of-date until I forced my browser to refresh the page. Upon investigation, I found that the principle pages of the website are served with only the following headers that relate to caching: Date, Last-Modified, and ETag. Since there is no Cache-Control header (or the older Expires or Pragma headers), browsers are left to their own heuristics as to how long to cache this content, which varies browser to browser. In some browsers, this heuristic is 10% of the difference between Date and Last-Modified headers. I take this to mean that, if the content were last modified 90 days ago, and I last accessed it 5 days ago, my browser will serve a cached response for a further 3.5 days (10% * (90 days - 5 days) = 8.5 days, 5 days have elapsed leaving 3.5 days). I'm not sure who at the ASF we should talk to about this, but I recommend we add the following header to any responses served from the Flink project website or official documentation website\[1]: {code}Cache-Control: max-age=0, must-revalidate{code} (Note this will only make browser revalidate their caches; if the ETag of the cached content matches what the server still has, the server will return 304 Not Modified and omit the actual content) \[1] Both the website hosted at flink.apache.org and the documentation hosted at ci.apache.org are affected. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5842) Wrong 'since' version for ElasticSearch 5.x connector
[ https://issues.apache.org/jira/browse/FLINK-5842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-5842: Assignee: Patrick Lucas > Wrong 'since' version for ElasticSearch 5.x connector > - > > Key: FLINK-5842 > URL: https://issues.apache.org/jira/browse/FLINK-5842 > Project: Flink > Issue Type: Bug > Components: Documentation, Streaming Connectors >Affects Versions: 1.3.0 >Reporter: Dawid Wysakowicz >Assignee: Patrick Lucas > > The documentation claims that ElasticSearch 5.x is supported since Flink > 1.2.0 which is not true, as the support was merged after 1.2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5751) 404 in documentation
[ https://issues.apache.org/jira/browse/FLINK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868981#comment-15868981 ] Patrick Lucas commented on FLINK-5751: -- Opened a PR fixing as many broken links as I could find using this method. The script I used: {code} #!/bin/bash -x # Don't abort on any non-zero exit code #set +e # Crawl the docs, ignoring robots.txt, storing nothing locally wget --spider -r -nd -nv -e robots=off -p -o spider.log http://localhost/flink-docs # Abort for anything other than 0 and 4 ("Network failure") status=$? if [ $status -ne 0 ] && [ $status -ne 4 ]; then exit $status fi # Fail the build if any broken links are found broken_links_str=$(grep -e 'Found [[:digit:]]\+ broken link' spider.log) echo -e "\e[1;31m$broken_links_str\e[0m" if [ -n "$broken_links_str" ]; then exit 1 fi {code} > 404 in documentation > > > Key: FLINK-5751 > URL: https://issues.apache.org/jira/browse/FLINK-5751 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Colin Breame >Priority: Trivial > > This page: > https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/setup_quickstart.html > Contains a link with title "Flink on Windows" with URL: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows > This gives a 404. It should be: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5751) 404 in documentation
[ https://issues.apache.org/jira/browse/FLINK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868901#comment-15868901 ] Patrick Lucas commented on FLINK-5751: -- Down to 15 now: {noformat} http://localhost/flink-docs/dev/stream/checkpointing http://localhost/flink-docs/dev/api_concepts http://localhost/flink-docs/dev/execution_configuration http://localhost/flink-docs/dev/parallel http://localhost/flink-docs/dev/restart_strategies http://localhost/flink-docs/dev/stream/state http://localhost/flink-docs/quickstart/run_example_quickstart http://localhost/dev/api_concepts http://localhost/dev/batch/index.html http://localhost/flink-docs/quickstart/scala_api_quickstart http://localhost/flink-docs/dev/datastream_api http://localhost/flink-docs/quickstart/java_api_quickstart http://localhost/flink-docs/dev/custom_serializers http://localhost/flink-docs/setup/config http://localhost/flink-docs/api/java/org/apache/flink/table/api/Table.html {noformat} > 404 in documentation > > > Key: FLINK-5751 > URL: https://issues.apache.org/jira/browse/FLINK-5751 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Colin Breame >Priority: Trivial > > This page: > https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/setup_quickstart.html > Contains a link with title "Flink on Windows" with URL: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows > This gives a 404. It should be: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5801) Queryable State from Scala job/client fails with key of type Long
Patrick Lucas created FLINK-5801: Summary: Queryable State from Scala job/client fails with key of type Long Key: FLINK-5801 URL: https://issues.apache.org/jira/browse/FLINK-5801 Project: Flink Issue Type: Bug Components: Queryable State Affects Versions: 1.2.0 Environment: Flink 1.2.0 Scala 2.10 Reporter: Patrick Lucas Attachments: OrderFulfillment.scala, OrderFulfillmentStateQuery.scala While working on a demo Flink job, to try out Queryable State, I exposed some state of type Long -> custom class via the Query server. However, the query server returned an exception when I tried to send a query: {noformat} Exception in thread "main" java.lang.RuntimeException: Failed to query state backend for query 0. Caused by: java.io.IOException: Unable to deserialize key and namespace. This indicates a mismatch in the key/namespace serializers used by the KvState instance and this access. at org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:392) at org.apache.flink.runtime.state.heap.AbstractHeapState.getSerializedValue(AbstractHeapState.java:130) at org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:220) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at org.apache.flink.runtime.util.DataInputDeserializer.readLong(DataInputDeserializer.java:217) at org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:69) at org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:27) at org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:379) ... 7 more at org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:257) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} I banged my head against this for a while, then per [~jgrier]'s suggestion I tried simply changing the key from Long to String (modifying the two {{keyBy}} calls and the {{keySerializer}} {{TypeHint}} in the attached code) and it started working perfectly. cc [~uce] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5801) Queryable State from Scala job/client fails with key of type Long
[ https://issues.apache.org/jira/browse/FLINK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas updated FLINK-5801: - Attachment: OrderFulfillmentStateQuery.scala OrderFulfillment.scala > Queryable State from Scala job/client fails with key of type Long > - > > Key: FLINK-5801 > URL: https://issues.apache.org/jira/browse/FLINK-5801 > Project: Flink > Issue Type: Bug > Components: Queryable State >Affects Versions: 1.2.0 > Environment: Flink 1.2.0 > Scala 2.10 >Reporter: Patrick Lucas > Attachments: OrderFulfillment.scala, OrderFulfillmentStateQuery.scala > > > While working on a demo Flink job, to try out Queryable State, I exposed some > state of type Long -> custom class via the Query server. However, the query > server returned an exception when I tried to send a query: > {noformat} > Exception in thread "main" java.lang.RuntimeException: Failed to query state > backend for query 0. Caused by: java.io.IOException: Unable to deserialize > key and namespace. This indicates a mismatch in the key/namespace serializers > used by the KvState instance and this access. > at > org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:392) > at > org.apache.flink.runtime.state.heap.AbstractHeapState.getSerializedValue(AbstractHeapState.java:130) > at > org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:220) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException > at > org.apache.flink.runtime.util.DataInputDeserializer.readLong(DataInputDeserializer.java:217) > at > org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:69) > at > org.apache.flink.api.common.typeutils.base.LongSerializer.deserialize(LongSerializer.java:27) > at > org.apache.flink.runtime.query.netty.message.KvStateRequestSerializer.deserializeKeyAndNamespace(KvStateRequestSerializer.java:379) > ... 7 more > at > org.apache.flink.runtime.query.netty.KvStateServerHandler$AsyncKvStateQueryTask.run(KvStateServerHandler.java:257) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I banged my head against this for a while, then per [~jgrier]'s suggestion I > tried simply changing the key from Long to String (modifying the two > {{keyBy}} calls and the {{keySerializer}} {{TypeHint}} in the attached code) > and it started working perfectly. > cc [~uce] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5751) 404 in documentation
[ https://issues.apache.org/jira/browse/FLINK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860369#comment-15860369 ] Patrick Lucas commented on FLINK-5751: -- I've identified the links as these using wget. It doesn't give the page they were found on (web crawling, pool of URLs and all that) but it's pretty easy to grep for them: {noformat} http://localhost/flink-docs/dev/api_concepts http://localhost/flink-docs/dev/execution_configuration http://localhost/flink-docs/internals/state_backends.html http://localhost/flink-docs/dev/parallel http://localhost/flink-docs/dev/restart_strategies http://localhost/flink-docs/dev/state.html http://localhost/flink-docs/dev/stream/state http://localhost/flink-docs/quickstart/run_example_quickstart http://localhost/dev/api_concepts http://localhost/dev/batch/index.html http://localhost/flink-docs/quickstart/scala_api_quickstart http://localhost/flink-docs/dev/datastream_api http://localhost/flink-docs/quickstart/java_api_quickstart http://localhost/flink-docs/setup/flink_on_windows http://localhost/flink-docs/dev/custom_serializers http://localhost/flink-docs/setup/config http://localhost/flink-docs/api/java/org/apache/flink/table/api/Table.html http://localhost/flink-docs/dev/state {noformat} > 404 in documentation > > > Key: FLINK-5751 > URL: https://issues.apache.org/jira/browse/FLINK-5751 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Colin Breame >Priority: Trivial > > This page: > https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/setup_quickstart.html > Contains a link with title "Flink on Windows" with URL: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows > This gives a 404. It should be: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5751) 404 in documentation
[ https://issues.apache.org/jira/browse/FLINK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860222#comment-15860222 ] Patrick Lucas commented on FLINK-5751: -- I've actually run into quite a few broken links in the docs in the past few days. For example, there are a number of links to {{/dev/api_concepts}} which are missing {{.html}}. (In the repo, run \[1]) In many of the pages are tables, encoded in the markdown files as raw HTML, which have many broken links as well (eg. [here|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/index.html#dataset-transformations], though here usually because they are hard-coded to start with `/` instead of prefixing `{{ site.baseurl }}` like the non-HTML links. I think I might put together a script to try to find broken links in the docs, as there seem to be many of them. \[1] {code}git grep -e api_concepts[^\\.]{code} P.S. I'll give a dollar to anyone who can tell me how to put that {{git grep}} snippet on the same line as other text in JIRA... > 404 in documentation > > > Key: FLINK-5751 > URL: https://issues.apache.org/jira/browse/FLINK-5751 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Colin Breame >Priority: Trivial > > This page: > https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/setup_quickstart.html > Contains a link with title "Flink on Windows" with URL: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows > This gives a 404. It should be: > - > https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/flink_on_windows.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-1588) Load flink configuration also from classloader
[ https://issues.apache.org/jira/browse/FLINK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856540#comment-15856540 ] Patrick Lucas commented on FLINK-1588: -- My gut feeling is to avoid adding a new way to configure Flink if no one is actively asking for it. I've already gotten what I want out of this ticket (a better understanding of this part of the codebase), so unless you think this actually is a feature we do want now, I would just assume close it. > Load flink configuration also from classloader > -- > > Key: FLINK-1588 > URL: https://issues.apache.org/jira/browse/FLINK-1588 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Robert Metzger >Assignee: Patrick Lucas > > The GlobalConfiguration object should also check if it finds the > flink-config.yaml in the classpath and load if from there. > This allows users to inject configuration files in local "standalone" or > embedded environments. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-1588) Load flink configuration also from classloader
[ https://issues.apache.org/jira/browse/FLINK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854230#comment-15854230 ] Patrick Lucas commented on FLINK-1588: -- I have a [first pass branch|https://github.com/patricklucas/flink/commits/FLINK-1588_load_config_from_classpath] for this, but I have just a few questions about what we actually want here: # Is the goal to search the directories on the classpath for a file named "flink-config.yaml" (as my branch does) or load a resource using getClassLoader().getResource("/flink-config.yaml") (requiring us to know the exact resource path) # Do we want to load config from the classpath as a fallback if the config dir or config file don't exist, or do we want to somehow merge the configs? # There's a [version of loadConfiguration with no params|https://github.com/patricklucas/flink/commit/deb10d6a065817569fb5659a7a4091e5984d5e11#diff-0a9e260480028baad50a56024fc3e84aR75] that currently returns an empty config object. Do we want to also search the classpath here? I also notice that it does not inject "dynamic properties" like the other overloaded method does, but since this one appears to be chiefly for testing, perhaps that isn't a problem. > Load flink configuration also from classloader > -- > > Key: FLINK-1588 > URL: https://issues.apache.org/jira/browse/FLINK-1588 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Robert Metzger >Assignee: Patrick Lucas > > The GlobalConfiguration object should also check if it finds the > flink-config.yaml in the classpath and load if from there. > This allows users to inject configuration files in local "standalone" or > embedded environments. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5153) Allow setting custom application tags for Flink on YARN
[ https://issues.apache.org/jira/browse/FLINK-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-5153: Assignee: Patrick Lucas > Allow setting custom application tags for Flink on YARN > --- > > Key: FLINK-5153 > URL: https://issues.apache.org/jira/browse/FLINK-5153 > Project: Flink > Issue Type: Improvement > Components: YARN >Reporter: Robert Metzger >Assignee: Patrick Lucas > > https://issues.apache.org/jira/browse/YARN-1399 added support in YARN to tag > applications. > We should introduce a configuration variable in Flink allowing users to > specify a comma-separated list of tags they want to assign to their Flink on > YARN applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-1588) Load flink configuration also from classloader
[ https://issues.apache.org/jira/browse/FLINK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Lucas reassigned FLINK-1588: Assignee: Patrick Lucas > Load flink configuration also from classloader > -- > > Key: FLINK-1588 > URL: https://issues.apache.org/jira/browse/FLINK-1588 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Robert Metzger >Assignee: Patrick Lucas > > The GlobalConfiguration object should also check if it finds the > flink-config.yaml in the classpath and load if from there. > This allows users to inject configuration files in local "standalone" or > embedded environments. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-1588) Load flink configuration also from classloader
[ https://issues.apache.org/jira/browse/FLINK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852224#comment-15852224 ] Patrick Lucas commented on FLINK-1588: -- Should it search and apply the configuration found on the classpath only when configDir is unset or flink-config.yaml can't be found in configDir? Or should it load configuration from both, preferring the values found in configDir/flink-config.yaml? > Load flink configuration also from classloader > -- > > Key: FLINK-1588 > URL: https://issues.apache.org/jira/browse/FLINK-1588 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Robert Metzger > > The GlobalConfiguration object should also check if it finds the > flink-config.yaml in the classpath and load if from there. > This allows users to inject configuration files in local "standalone" or > embedded environments. -- This message was sent by Atlassian JIRA (v6.3.15#6346)