Re: long lived standalone job session cluster in kubernetes

2019-02-27 Thread Heath Albritton
Great, my team is eager to get started.  I’m curious what progress had been 
made so far?

-H

> On Feb 26, 2019, at 14:43, Chunhui Shi  wrote:
> 
> Hi Heath and Till, thanks for offering help on reviewing this feature. I just 
> reassigned the JIRAs to myself after offline discussion with Jin. Let us work 
> together to get kubernetes integrated natively with flink. Thanks.
> 
>> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann  wrote:
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for your 
>> help :-)
>> 
>> Cheers,
>> Till
>> 
>>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton  wrote:
>>> My team and I are keen to help out with testing and review as soon as there 
>>> is a pill request.
>>> 
>>> -H
>>> 
>>>> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
>>>> 
>>>> Hi Heath,
>>>> 
>>>> I just learned that people from Alibaba already made some good progress 
>>>> with FLINK-9953. I'm currently talking to them in order to see how we can 
>>>> merge this contribution into Flink as fast as possible. Since I'm quite 
>>>> busy due to the upcoming release I hope that other community members will 
>>>> help out with the reviewing once the PRs are opened.
>>>> 
>>>> Cheers,
>>>> Till
>>>> 
>>>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
>>>>> Has any progress been made on this?  There are a number of folks in
>>>>> the community looking to help out.
>>>>> 
>>>>> 
>>>>> -H
>>>>> 
>>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  
>>>>> wrote:
>>>>> >
>>>>> > Hi Derek,
>>>>> >
>>>>> > there is this issue [1] which tracks the active Kubernetes integration. 
>>>>> > Jin Sun already started implementing some parts of it. There should 
>>>>> > also be some PRs open for it. Please check them out.
>>>>> >
>>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>>>> >
>>>>> > Cheers,
>>>>> > Till
>>>>> >
>>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  
>>>>> > wrote:
>>>>> >>
>>>>> >> Sounds good.
>>>>> >>
>>>>> >> Is someone working on this automation today?
>>>>> >>
>>>>> >> If not, although my time is tight, I may be able to work on a PR for 
>>>>> >> getting us started down the path Kubernetes native cluster mode.
>>>>> >>
>>>>> >>
>>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>>>> >>
>>>>> >> Hi Derek,
>>>>> >>
>>>>> >> what I would recommend to use is to trigger the cancel with savepoint 
>>>>> >> command [1]. This will create a savepoint and terminate the job 
>>>>> >> execution. Next you simply need to respawn the job cluster which you 
>>>>> >> provide with the savepoint to resume from.
>>>>> >>
>>>>> >> [1] 
>>>>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Till
>>>>> >>
>>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
>>>>> >>  wrote:
>>>>> >>>
>>>>> >>> Hi Derek,
>>>>> >>>
>>>>> >>> I think your automation steps look good.
>>>>> >>> Recreating deployments should not take long
>>>>> >>> and as you mention, this way you can avoid unpredictable old/new 
>>>>> >>> version collisions.
>>>>> >>>
>>>>> >>> Best,
>>>>> >>> Andrey
>>>>> >>>
>>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  
>>>>> >>> > wrote:
>>>>> >>> >
>>>>> >>> > Hi Derek,
>>>>> >>> >
>>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be 
>>&g

Re: long lived standalone job session cluster in kubernetes

2019-02-14 Thread Heath Albritton
My team and I are keen to help out with testing and review as soon as there is 
a pill request.

-H

> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
> 
> Hi Heath,
> 
> I just learned that people from Alibaba already made some good progress with 
> FLINK-9953. I'm currently talking to them in order to see how we can merge 
> this contribution into Flink as fast as possible. Since I'm quite busy due to 
> the upcoming release I hope that other community members will help out with 
> the reviewing once the PRs are opened.
> 
> Cheers,
> Till
> 
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>> 
>> 
>> -H
>> 
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration. 
>> > Jin Sun already started implementing some parts of it. There should also 
>> > be some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for 
>> >> getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint 
>> >> command [1]. This will create a savepoint and terminate the job 
>> >> execution. Next you simply need to respawn the job cluster which you 
>> >> provide with the savepoint to resume from.
>> >>
>> >> [1] 
>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
>> >>  wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new version 
>> >>> collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  
>> >>> > wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>


Re: long lived standalone job session cluster in kubernetes

2019-02-08 Thread Heath Albritton
Has any progress been made on this?  There are a number of folks in
the community looking to help out.


-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin 
> Sun already started implementing some parts of it. There should also be some 
> PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting 
>> us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint 
>> command [1]. This will create a savepoint and terminate the job execution. 
>> Next you simply need to respawn the job cluster which you provide with the 
>> savepoint to resume from.
>>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin  
>> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version 
>>> collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>


Intermittent issue with GCS storage

2018-09-25 Thread Heath Albritton
Howdy folks,

I'm attempting to get Flink running in a Kubernetes cluster with the
ultimate goal of using GCS for checkpoints and savepoints.  I've used
the helm chart to deploy and followed this guide, modified for 1.6.0:

https://data-artisans.com/blog/getting-started-with-da-platform-on-google-kubernetes-engine

I've built a container putting these:
flink-shaded-hadoop2-uber-1.6.0.jar
gcs-connector-hadoop2-latest.jar
in /opt/flink/lib

I've been running WordCount.jar to test, using the input and output
flags pointing at a GCS bucket.  I've verified that my two jars show
up in the classpath in the logs, but when I run the job it throws the
following errors:

flink-flink-jobmanager-9766f9b4c-kfkk5: Caused by:
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could
not find a file system implementation for scheme 'gs'. The scheme is
not directly supported by Flink and no Hadoop file system to support
this scheme could be loaded.
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:403)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:275)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:89)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216)

I've been wrestling with this a fair bit.  Eventually I built a
container with the core-site.xml and the GCS key in the
/opt/flink/etc-hadoop directory and then set HADOOP_CONF_DIR to point
there.

I've discovered that I can run the container in standalone mode using
the start-cluster.sh script, it works just fine.  I can replicate this
in kubernetes and locally using docker as well as locally.

If I start the job manager and the task manager individually using
their respective scripts, I get the aforementioned error.  Oddly, I
get issues when running locally as well, if I use the start-cluster.sh
script, my wordcount test works just fine.  If I start the job manager
and task manager processes using their scripts, I can read the file
from GCS, but I get a 403 when trying to write the output.

I've no idea how to proceed with troubleshooting this further as I'm a
newbie to flink.  Some direction would be helpful.


Cheers,

Heath Albritton