Re: long lived standalone job session cluster in kubernetes
Great, my team is eager to get started. I’m curious what progress had been made so far? -H > On Feb 26, 2019, at 14:43, Chunhui Shi wrote: > > Hi Heath and Till, thanks for offering help on reviewing this feature. I just > reassigned the JIRAs to myself after offline discussion with Jin. Let us work > together to get kubernetes integrated natively with flink. Thanks. > >> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann wrote: >> Alright, I'll get back to you once the PRs are open. Thanks a lot for your >> help :-) >> >> Cheers, >> Till >> >>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton wrote: >>> My team and I are keen to help out with testing and review as soon as there >>> is a pill request. >>> >>> -H >>> >>>> On Feb 11, 2019, at 00:26, Till Rohrmann wrote: >>>> >>>> Hi Heath, >>>> >>>> I just learned that people from Alibaba already made some good progress >>>> with FLINK-9953. I'm currently talking to them in order to see how we can >>>> merge this contribution into Flink as fast as possible. Since I'm quite >>>> busy due to the upcoming release I hope that other community members will >>>> help out with the reviewing once the PRs are opened. >>>> >>>> Cheers, >>>> Till >>>> >>>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: >>>>> Has any progress been made on this? There are a number of folks in >>>>> the community looking to help out. >>>>> >>>>> >>>>> -H >>>>> >>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann >>>>> wrote: >>>>> > >>>>> > Hi Derek, >>>>> > >>>>> > there is this issue [1] which tracks the active Kubernetes integration. >>>>> > Jin Sun already started implementing some parts of it. There should >>>>> > also be some PRs open for it. Please check them out. >>>>> > >>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >>>>> > >>>>> > Cheers, >>>>> > Till >>>>> > >>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee >>>>> > wrote: >>>>> >> >>>>> >> Sounds good. >>>>> >> >>>>> >> Is someone working on this automation today? >>>>> >> >>>>> >> If not, although my time is tight, I may be able to work on a PR for >>>>> >> getting us started down the path Kubernetes native cluster mode. >>>>> >> >>>>> >> >>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >>>>> >> >>>>> >> Hi Derek, >>>>> >> >>>>> >> what I would recommend to use is to trigger the cancel with savepoint >>>>> >> command [1]. This will create a savepoint and terminate the job >>>>> >> execution. Next you simply need to respawn the job cluster which you >>>>> >> provide with the savepoint to resume from. >>>>> >> >>>>> >> [1] >>>>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >>>>> >> >>>>> >> Cheers, >>>>> >> Till >>>>> >> >>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin >>>>> >> wrote: >>>>> >>> >>>>> >>> Hi Derek, >>>>> >>> >>>>> >>> I think your automation steps look good. >>>>> >>> Recreating deployments should not take long >>>>> >>> and as you mention, this way you can avoid unpredictable old/new >>>>> >>> version collisions. >>>>> >>> >>>>> >>> Best, >>>>> >>> Andrey >>>>> >>> >>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >>>>> >>> > wrote: >>>>> >>> > >>>>> >>> > Hi Derek, >>>>> >>> > >>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be >>&g
Re: long lived standalone job session cluster in kubernetes
My team and I are keen to help out with testing and review as soon as there is a pill request. -H > On Feb 11, 2019, at 00:26, Till Rohrmann wrote: > > Hi Heath, > > I just learned that people from Alibaba already made some good progress with > FLINK-9953. I'm currently talking to them in order to see how we can merge > this contribution into Flink as fast as possible. Since I'm quite busy due to > the upcoming release I hope that other community members will help out with > the reviewing once the PRs are opened. > > Cheers, > Till > >> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: >> Has any progress been made on this? There are a number of folks in >> the community looking to help out. >> >> >> -H >> >> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: >> > >> > Hi Derek, >> > >> > there is this issue [1] which tracks the active Kubernetes integration. >> > Jin Sun already started implementing some parts of it. There should also >> > be some PRs open for it. Please check them out. >> > >> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >> > >> > Cheers, >> > Till >> > >> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: >> >> >> >> Sounds good. >> >> >> >> Is someone working on this automation today? >> >> >> >> If not, although my time is tight, I may be able to work on a PR for >> >> getting us started down the path Kubernetes native cluster mode. >> >> >> >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> >> >> Hi Derek, >> >> >> >> what I would recommend to use is to trigger the cancel with savepoint >> >> command [1]. This will create a savepoint and terminate the job >> >> execution. Next you simply need to respawn the job cluster which you >> >> provide with the savepoint to resume from. >> >> >> >> [1] >> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> >> >> Cheers, >> >> Till >> >> >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin >> >> wrote: >> >>> >> >>> Hi Derek, >> >>> >> >>> I think your automation steps look good. >> >>> Recreating deployments should not take long >> >>> and as you mention, this way you can avoid unpredictable old/new version >> >>> collisions. >> >>> >> >>> Best, >> >>> Andrey >> >>> >> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >> >>> > wrote: >> >>> > >> >>> > Hi Derek, >> >>> > >> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >> >>> > to help you more. >> >>> > >> >>> > As for the automation for similar process I would recommend having a >> >>> > look at dA platform[1] which is built on top of kubernetes. >> >>> > >> >>> > Best, >> >>> > >> >>> > Dawid >> >>> > >> >>> > [1] https://data-artisans.com/platform-overview >> >>> > >> >>> > On 30/11/2018 02:10, Derek VerLee wrote: >> >>> >> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >> >>> >> considering migrating our jobs off our "legacy" session cluster and >> >>> >> into Kubernetes. >> >>> >> >> >>> >> I do need to ask some questions because I haven't found a lot of >> >>> >> details in the documentation about how it works yet, and I gave up >> >>> >> following the the DI around in the code after a while. >> >>> >> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> >>> >> another deployment for the taskmanagers. >> >>> >> >> >>> >> I want to upgrade the code or configuration and start from a >> >>> >> savepoint, in an automated way. >> >>> >> >> >>> >> Best I can figure, I can not just update the deployment resources in >> >>> >> kubernetes and allow the containers to restart in an arbitrary order. >> >>> >> >> >>> >> Instead, I expect sequencing is important, something along the lines >> >>> >> of this: >> >>> >> >> >>> >> 1. issue savepoint command on leader >> >>> >> 2. wait for savepoint >> >>> >> 3. destroy all leader and taskmanager containers >> >>> >> 4. deploy new leader, with savepoint url >> >>> >> 5. deploy new taskmanagers >> >>> >> >> >>> >> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >> >>> >> job) attaching to the new leader and causing a problem. >> >>> >> >> >>> >> Does that sound right, or am I overthinking it? >> >>> >> >> >>> >> If not, has anyone tried implementing any automation for this yet? >> >>> >> >> >>> > >> >>>
Re: long lived standalone job session cluster in kubernetes
Has any progress been made on this? There are a number of folks in the community looking to help out. -H On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: > > Hi Derek, > > there is this issue [1] which tracks the active Kubernetes integration. Jin > Sun already started implementing some parts of it. There should also be some > PRs open for it. Please check them out. > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > Cheers, > Till > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: >> >> Sounds good. >> >> Is someone working on this automation today? >> >> If not, although my time is tight, I may be able to work on a PR for getting >> us started down the path Kubernetes native cluster mode. >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> Hi Derek, >> >> what I would recommend to use is to trigger the cancel with savepoint >> command [1]. This will create a savepoint and terminate the job execution. >> Next you simply need to respawn the job cluster which you provide with the >> savepoint to resume from. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> Cheers, >> Till >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin >> wrote: >>> >>> Hi Derek, >>> >>> I think your automation steps look good. >>> Recreating deployments should not take long >>> and as you mention, this way you can avoid unpredictable old/new version >>> collisions. >>> >>> Best, >>> Andrey >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz wrote: >>> > >>> > Hi Derek, >>> > >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >>> > to help you more. >>> > >>> > As for the automation for similar process I would recommend having a >>> > look at dA platform[1] which is built on top of kubernetes. >>> > >>> > Best, >>> > >>> > Dawid >>> > >>> > [1] https://data-artisans.com/platform-overview >>> > >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >> considering migrating our jobs off our "legacy" session cluster and >>> >> into Kubernetes. >>> >> >>> >> I do need to ask some questions because I haven't found a lot of >>> >> details in the documentation about how it works yet, and I gave up >>> >> following the the DI around in the code after a while. >>> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >>> >> another deployment for the taskmanagers. >>> >> >>> >> I want to upgrade the code or configuration and start from a >>> >> savepoint, in an automated way. >>> >> >>> >> Best I can figure, I can not just update the deployment resources in >>> >> kubernetes and allow the containers to restart in an arbitrary order. >>> >> >>> >> Instead, I expect sequencing is important, something along the lines >>> >> of this: >>> >> >>> >> 1. issue savepoint command on leader >>> >> 2. wait for savepoint >>> >> 3. destroy all leader and taskmanager containers >>> >> 4. deploy new leader, with savepoint url >>> >> 5. deploy new taskmanagers >>> >> >>> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >>> >> job) attaching to the new leader and causing a problem. >>> >> >>> >> Does that sound right, or am I overthinking it? >>> >> >>> >> If not, has anyone tried implementing any automation for this yet? >>> >> >>> > >>>
Intermittent issue with GCS storage
Howdy folks, I'm attempting to get Flink running in a Kubernetes cluster with the ultimate goal of using GCS for checkpoints and savepoints. I've used the helm chart to deploy and followed this guide, modified for 1.6.0: https://data-artisans.com/blog/getting-started-with-da-platform-on-google-kubernetes-engine I've built a container putting these: flink-shaded-hadoop2-uber-1.6.0.jar gcs-connector-hadoop2-latest.jar in /opt/flink/lib I've been running WordCount.jar to test, using the input and output flags pointing at a GCS bucket. I've verified that my two jars show up in the classpath in the logs, but when I run the job it throws the following errors: flink-flink-jobmanager-9766f9b4c-kfkk5: Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:403) flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318) flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298) flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:275) flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:89) flink-flink-jobmanager-9766f9b4c-kfkk5: at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216) I've been wrestling with this a fair bit. Eventually I built a container with the core-site.xml and the GCS key in the /opt/flink/etc-hadoop directory and then set HADOOP_CONF_DIR to point there. I've discovered that I can run the container in standalone mode using the start-cluster.sh script, it works just fine. I can replicate this in kubernetes and locally using docker as well as locally. If I start the job manager and the task manager individually using their respective scripts, I get the aforementioned error. Oddly, I get issues when running locally as well, if I use the start-cluster.sh script, my wordcount test works just fine. If I start the job manager and task manager processes using their scripts, I can read the file from GCS, but I get a 403 when trying to write the output. I've no idea how to proceed with troubleshooting this further as I'm a newbie to flink. Some direction would be helpful. Cheers, Heath Albritton