[jira] [Commented] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+

2023-01-05 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655070#comment-17655070
 ] 

Thomas Wozniakowski commented on FLINK-30562:
-

 [^flink-asf-30562-clean.zip] 

I've produced a (relatively) simple project here that reproduces the problem. 
Please let me know if you have any questions.

> CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
> 
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
> Attachments: flink-asf-30562-clean.zip
>
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+

2023-01-05 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-30562:

Attachment: flink-asf-30562-clean.zip

> CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
> 
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
> Attachments: flink-asf-30562-clean.zip
>
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+

2023-01-05 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-30562:

Component/s: API / DataStream

> CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
> 
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+

2023-01-05 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-30562:

Summary: CEP Operator misses patterns on SideOutputs and parallelism >1 
since 1.15.x+  (was: Patterns are not emitted with parallelism >1 since 1.15.x+)

> CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
> 
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+

2023-01-05 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655011#comment-17655011
 ] 

Thomas Wozniakowski commented on FLINK-30562:
-

Hi [~bgeng777]

I've made some progress in narrowing down the problem. I am still working on 
producing a reproducible code snippet I can share, but the problem is 
definitely related to *Side Outputs*.

For context, we use Side Outputs to route events to different CEP operators 
depending on a Customer ID value (different customers are interested in 
different CEP sequences). We previously used the {{.split()}} operator before 
it was deprecated.

We set up the side outputs with a call like this (I have dramatically 
simplified the code but the problem is still occurring with the code in this 
form):

{code:java}
streamWithSideOutputs = stream.process(new BrandedSideOutputFunction());

// Where the side output function ...

public static class BrandedSideOutputFunction extends 
ProcessFunction {

private final OutputTag outputTag = new 
OutputTag<>("RED_BRAND", TypeInformation.of(PlatformEvent.class));

@Override
public void processElement(PlatformEvent value, Context ctx, 
Collector out) {
ctx.output(outputTag, value);
out.collect(value);
}
}
{code}

You'll note that obviously this side output function only actually outputs to 
one, hardcoded side output. The real code is more complex but as I say, the 
problem still occurs with the code as written above.

With this {{.process(...)}} call upstream of the CEP operators, and the 
{{parallelism}} set to a value greater than 1, the Patterns will fail to be 
detected roughly 1/3rd of the time. Note that this happens even if I connect 
the CEP operator to either the *main* {{DataStream}} or to a side output via 
{{.getSideOutput(tag)}}.

If the {{parallelism}} is set to 1, or if I remove the side-output generating 
{{.process(...)}} call and connect the CEP operator directly to the existing 
{{DataStream}}, the Patterns will be detected 100% of the time.

There seems to be something up with the interaction between side outputs, 
parallelism and the CEP operator in Flink 1.15.0+.

I will keep working on producing a project I can share reproducing this 
problem, but hopefully this gives you something to go on?


> Patterns are not emitted with parallelism >1 since 1.15.x+
> --
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, 

[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+

2023-01-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654511#comment-17654511
 ] 

Thomas Wozniakowski commented on FLINK-30562:
-

Hi [~bgeng777], thanks for the quick response. Your demo is roughly the same as 
the one I'm trying to set up to reproduce the issue in a compact way. I will 
use it for guidance to see if I can get something useful available. 

My experiments are showing:

*Flink Versions 1.4.3, parallelism: any*
CEP operators produce expected output

*Flink Versions 1.5.x+, parallelism: 1*
CEP operators produce expected output

*Flink Versions 1.5.x+, parallelism: 2+*
CEP operators produce no output at all

It's worth noting that we did not change any code related to our CEP usage 
between these tests, we simply updated the library versions.

We are using more pattern constraints than exist in your test file, I'm 
wondering if it might be related to one of those. For example, we use 
".within(...)" and ".times(...)" on most of our Pattern definitions.

> Patterns are not emitted with parallelism >1 since 1.15.x+
> --
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+

2023-01-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-30562:

Description: 
(Apologies for the speculative and somewhat vague ticket, but I wanted to raise 
this while I am investigating to see if anyone has suggestions to help me 
narrow down the problem.)

We are encountering an issue where our streaming Flink job has stopped working 
correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The 
Keyed CEP operators that our job uses are no longer emitting Patterns reliably, 
but critically *this is only happening when parallelism is set to a value 
greater than 1*. 

Our local build tests were previously set up using in-JVM `MiniCluster` 
instances, or dockerised Flink clusters all set with a parallelism of 1, so 
this problem was not caught and it caused an outage when we upgraded the 
cluster version in production.

Observing the job using the Flink console in production, I can see that events 
are *arriving* into the Keyed CEP operators, but no Pattern events are being 
emitted out of any of the operators. Furthermore, all the reported Watermark 
values are zero, though I don't know if that is a red herring as it seems 
Watermark reporting seems to have changed since 1.14.x.

I am currently attempting to create a stripped down version of our streaming 
job to demonstrate the problem, but this is quite tricky to set up. In the 
meantime I would appreciate any hints that could point me in the right 
direction.

I have isolated the problem to the Keyed CEP operator by removing our real 
sinks and sources from the failing test. I am still seeing the erroneous 
behaviour when setting up a job as:

# Events are read from a list using `env.fromCollection( ... )`
# CEP operator processes events
# Output is captured in another list for assertions

My best guess at the moment is something to do with Watermark emission? There 
seems to have been changes related to watermark alignment, perhaps this has 
caused some kind of regression in the CEP library? To reiterate, *this problem 
only occurs with parallelism of 2 or more. Setting the parallelism to 1 
immediately fixes the issue*

  was:
(Apologies for the speculative and somewhat vague ticket, but I wanted to raise 
this while I am investigating to see if anyone has suggestions to help me 
narrow down the problem.)

We are encountering an issue where our streaming Flink job has stopped working 
correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The 
Keyed CEP operators that our job uses are no longer emitting Patterns reliably, 
but critically *this is only happening when parallelism is set to a value 
greater than 1*. 

Our local build tests were previously set up using in-JVM `MiniCluster` 
instances, or dockerised Flink clusters all set with a parallelism of 1, so 
this problem was not caught and it caused an outage when we upgraded the 
cluster version in production.

Observing the job using the Flink console in production, I can see that events 
are *arriving* into the Keyed CEP operators, but no Pattern events are being 
emitted out of any of the operators. Furthermore, all the reported Watermark 
values are zero, though I don't know if that is a red herring as it seems 
Watermark reporting seems to have changed since 1.14.x.

I am currently attempting to create a stripped down version of our streaming 
job to demonstrate the problem, but this is quite tricky to set up. In the 
meantime I would appreciate any hints that could point me in the right 
direction.

I have isolated the problem to the Keyed CEP operator by removing our real 
sinks and sources from the failing test. I am still seeing the erroneous 
behaviour when setting up a job as:

# Events are read from a list using `env.fromCollection( ... )`
# CEP operator processes events
# Output is captured in another list for assertions

My best guess at the moment is something to do with Watermark emission? There 
seems to have been changes related to watermark alignment, perhaps this has 
caused some kind of regression in the CEP library?


> Patterns are not emitted with parallelism >1 since 1.15.x+
> --
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am 

[jira] [Created] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+

2023-01-04 Thread Thomas Wozniakowski (Jira)
Thomas Wozniakowski created FLINK-30562:
---

 Summary: Patterns are not emitted with parallelism >1 since 1.15.x+
 Key: FLINK-30562
 URL: https://issues.apache.org/jira/browse/FLINK-30562
 Project: Flink
  Issue Type: Bug
  Components: Library / CEP
Affects Versions: 1.15.3, 1.16.0
 Environment: Problem observed in:

Production:
Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
sink to AWS SQS

Local:
Completely local MiniCluster based test with no external sinks or sources
Reporter: Thomas Wozniakowski


(Apologies for the speculative and somewhat vague ticket, but I wanted to raise 
this while I am investigating to see if anyone has suggestions to help me 
narrow down the problem.)

We are encountering an issue where our streaming Flink job has stopped working 
correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The 
Keyed CEP operators that our job uses are no longer emitting Patterns reliably, 
but critically *this is only happening when parallelism is set to a value 
greater than 1*. 

Our local build tests were previously set up using in-JVM `MiniCluster` 
instances, or dockerised Flink clusters all set with a parallelism of 1, so 
this problem was not caught and it caused an outage when we upgraded the 
cluster version in production.

Observing the job using the Flink console in production, I can see that events 
are *arriving* into the Keyed CEP operators, but no Pattern events are being 
emitted out of any of the operators. Furthermore, all the reported Watermark 
values are zero, though I don't know if that is a red herring as it seems 
Watermark reporting seems to have changed since 1.14.x.

I am currently attempting to create a stripped down version of our streaming 
job to demonstrate the problem, but this is quite tricky to set up. In the 
meantime I would appreciate any hints that could point me in the right 
direction.

I have isolated the problem to the Keyed CEP operator by removing our real 
sinks and sources from the failing test. I am still seeing the erroneous 
behaviour when setting up a job as:

# Events are read from a list using `env.fromCollection( ... )`
# CEP operator processes events
# Output is captured in another list for assertions

My best guess at the moment is something to do with Watermark emission? There 
seems to have been changes related to watermark alignment, perhaps this has 
caused some kind of regression in the CEP library?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-11 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247999#comment-17247999
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

[~dwysakowicz] I've emailed you over the code + JAR. Please give me a shout 
here or over email if you need anything else from me.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-08 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245902#comment-17245902
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Ok, I am working on this now. It's going to take me a while to cut down 
everything to get it into one self-contained JAR but I'll send it over as soon 
as I'm done.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-07 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245327#comment-17245327
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Ok, I've built another version of our app this time using every artefact built 
from your branch (all parts of Flink). The bug is still visible (state still 
grows unbounded).

I'm going to try and get approval internally to send you a cut down version of 
our app that exhibits this behaviour. Do you just need a job JAR that you can 
start on a cluster and see the effect?

Also it would be great if there is a slightly more confidential way I can send 
the code to you, it's not hyper sensitive or anything but I'd rather not post 
it on a public JIRA ticket.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244094#comment-17244094
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Ah sorry, I mixed up the `key` and `id` value when I was reading your code. 

This is a screenshot from IntelliJ of the imported libraries from my test JAR:

 !screenshot-3.png! 

I pushed your branch to our internal artifactory under the snapshot version. 
I'm going to try pushing again with a more specific version name to make sure 
I'm not somehow pulling in snapshot versions from somewhere else, but getting 
this far was extremely painful already due to getting the maven build to work 
nice with our artifactory.

Would it be possible to stick a temporary log line somewhere in an init 
function for one of the CEP operators that I could look out for to confirm 100% 
that it's running with the right branch version on my remote test cluster?

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Attachment: screenshot-3.png

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244083#comment-17244083
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hey [~dwysakowicz]

Am I correct in saying that your scenario above matches my scenario #2 from the 
original description? That is to say the "constant key rotation"  scenario, 
where keys only have one or two events before they go dormant and should be 
cleaned up?

The test I ran above was scenario #1 which was the "no key rotation" one. Just 
the same keys emitting events over and over forever. The equivalent in your 
test would be to hold the `id` value in your generated events constant.

I will rerun with the branch version on scenario #2 (to match your test setup) 
and see how it behaves.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244047#comment-17244047
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hey [~dwysakowicz],

Just to confirm, it's just the `flink-cep` module that needs to be replaced in 
my Job JAR? No other flink libraries need to be swapped out and it's ok to run 
the actual cluster on the release version?

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Attachment: screenshot-2.png

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243915#comment-17243915
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Results over a longer period, state still growing linearly:

 !screenshot-2.png! 

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, 
> screenshot-2.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-03 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243327#comment-17243327
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

 !screenshot-1.png! 

Unfortunately it doesn't appear to have stopped the state leak. I haven't let 
the test run for the full period but you can see the size continue to grow past 
the 30-45 minute mark where it should start discarding events. I'll keep it 
running for the full 24 to be representative but I think we might need to keep 
digging.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-03 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Attachment: screenshot-1.png

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-03 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243202#comment-17243202
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Ok, our load test is running now after a surprisingly painful setup process 
importing branch code. I'm re-running scenario #1 above and will post the 
results after a few hours (it should be clear if the problem is fixed)

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-02 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242478#comment-17242478
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

For some reason I can't compile your branch, I get:


{code}
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/home/jamalarm/src/open/flink/flink-libraries/flink-cep/src/test/java/org/apache/flink/cep/nfa/NFAITCase.java:[39,19]
 package javafx.util does not exist
[INFO] 1 error
[INFO] -
{code}

Any ideas?


> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-02 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242459#comment-17242459
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Do I need to rebuild our Job JAR using the CEP library from your branch? Or is 
this a TaskManager-side fix?

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-02 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242433#comment-17242433
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hey [~dwysakowicz] - great to hear you have a candidate fix!

In order for us to test it I would need to package your branch as a docker 
container. Our load testing environment runs exclusively on containers and it 
would be... a profound headache to try and make it run any other way.

We can publish it to an internal ECR repo on our side, that shouldn't be too 
hard. Is there a relatively straightforward way to check your branch out and 
package it up as an image locally?

Thanks

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-12-01 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241445#comment-17241445
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hi [~wind_ljy],

No, we didn't identify a fix. We're not really familiar with the Flink codebase 
and our team is pretty small, so our plan was to wait until [~dwysakowicz] was 
finished with the 1.12.0 release and had some time to look at the issue.

We would obviously be very grateful if you were able to spare some time to dig 
into this. Our production system is effectively a ticking time bomb at the 
moment with this issue.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19293) RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore

2020-11-23 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237395#comment-17237395
 ] 

Thomas Wozniakowski commented on FLINK-19293:
-

Hi [~AHeise]

Sorry, I meant to update this ticket but forgot. So we spent some time 
isolating the behaviour and actually realised this is nothing to do with 
RocksDB. It's a bug in the CEP library where the state grows endlessly, I 
raised another ticket about it here:

https://issues.apache.org/jira/browse/FLINK-19970

Do you want me to go ahead and close this one? I'm not sure there's work to do 
here.

> RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore
> --
>
> Key: FLINK-19293
> URL: https://issues.apache.org/jira/browse/FLINK-19293
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP, Runtime / Checkpointing, Runtime / State 
> Backends
>Affects Versions: 1.10.1
>Reporter: Thomas Wozniakowski
>Priority: Major
> Attachments: Screenshot 2020-09-18 at 13.58.30.png
>
>
> Hi Guys,
> I am seeing some strange behaviour that may be a bug, or may just be intended.
> We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and 
> 2 TaskManagers, parallelism 4. The job itself is simple:
> # Source: kinesis connector reading from a single shard stream
> # CEP: ~25 CEP Keyed Pattern operators watching the event stream for 
> different kinds of behaviour. They all have ".withinSeconds()" applied. 
> Nothing is set up to grow endlessly.
> # Sink: Single operator writing messages to SQS (custom code)
> We are seeing the checkpoint size grow constantly until the job is restarted 
> using a savepoint/restore. The size continues to grow past the point that the 
> ".withinSeconds()" limits should cause old data to be discarded. The 
> growth is also out of proportion to the general platform growth (which is 
> actually trending down at the moment due to COVID).
> I've attached a snapshot from our monitoring dashboard below. You can see the 
> huge drops in state_size on a savepoint/restore.
> Our state configuration is as follows:
> Backend: RocksDB
> Mode: EXACTLY_ONCE
> Max Concurrent: 1
> Externalised Checkpoints: RETAIN_ON_CANCELLATION
> Async: TRUE
> Incremental: TRUE
> TTL Compaction Filter enabled: TRUE
> We are worried that the CEP library may be leaking state somewhere, leaving 
> some objects not cleaned up. Unfortunately I can't share one of these 
> checkpoints with the community due to the sensitive nature of the data 
> contained within, but if anyone has any suggestions for how I could analyse 
> the checkpoints to look for leaks, please let me know.
> Thanks in advance for the help



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-16 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232913#comment-17232913
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hey [~dwysakowicz] - thanks for the update. Please let us know if there's 
anything we can do to help.

Happy to test out a branch version of Flink in our load test environment when 
it is available. The only issue is that we use the docker images so I'd need 
some way to build a branch docker image.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-06 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227329#comment-17227329
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

[~dwysakowicz] Please give me a shout if there's any more diagnostic info I can 
attach to assist with debugging this. This one is blocking us quite severely so 
I'm more than happy to help any way I can.

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2.  My 
> understanding is that the checkpoint size should level off after ~45 minutes 
> or so then stay constant.
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Description: 
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2.  My understanding is 
that the checkpoint size should level off after ~45 minutes or so then stay 
constant.

!image-2020-11-04-11-35-12-126.png! 

Could someone please assist us with this? Unless we have dramatically 
misunderstood how the CEP library is supposed to function this seems like a 
pretty severe bug.

  was:
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right 

[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226022#comment-17226022
 ] 

Thomas Wozniakowski commented on FLINK-19970:
-

Hi [~dwysakowicz],

Sorry for not including that info in the original post.

We are using event time with a custom watermarking strategy based on an average 
of the last 10 events' timestamps + a  constant buffer of 15 minutes.

The watermarking strategy is working just fine. Test #2 is actually still 
running and I can see the Low Watermark of the CEP operators is 1604492129185 
(15 minutes ago) as expected.

Note that this setup is also producing matches just fine (with increased 
frequency of event emission). If the watermarks weren't being correctly 
assigned then we would never see matches coming out the other end of the CEP 
operators, right?

> State leak in CEP Operators (expired events/keys not removed from state)
> 
>
> Key: FLINK-19970
> URL: https://issues.apache.org/jira/browse/FLINK-19970
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.11.2
> Environment: Flink 1.11.2 run using the official docker containers in 
> AWS ECS Fargate.
> 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
>Reporter: Thomas Wozniakowski
>Priority: Critical
> Attachments: image-2020-11-04-11-35-12-126.png
>
>
> We have been observing instability in our production environment recently, 
> seemingly related to state backends. We ended up building a load testing 
> environment to isolate factors and have discovered that the CEP library 
> appears to have some serious problems with state expiry.
> h2. Job Topology
> Source: Kinesis (standard connector) -> keyBy() and forward to...
> CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
> output to...
> Sink: SQS (custom connector)
> The CEP Patterns in the test look like this:
> {code:java}
> Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(20)
> .subtype(ScanEvent.class)
> .within(Duration.minutes(30));
> {code}
> h2. Taskmanager Config
> {code:java}
> taskmanager.numberOfTaskSlots: $numberOfTaskSlots
> taskmanager.data.port: 6121
> taskmanager.rpc.port: 6122
> taskmanager.exit-on-fatal-akka-error: true
> taskmanager.memory.process.size: $memoryProcessSize
> taskmanager.memory.jvm-metaspace.size: 256m
> taskmanager.memory.managed.size: 0m
> jobmanager.rpc.port: 6123
> blob.server.port: 6130
> rest.port: 8081
> web.submit.enable: true
> fs.s3a.connection.maximum: 50
> fs.s3a.threads.max: 50
> akka.framesize: 250m
> akka.watch.threshold: 14
> state.checkpoints.dir: s3://$savepointBucketName/checkpoints
> state.savepoints.dir: s3://$savepointBucketName/savepoints
> state.backend: filesystem
> state.backend.async: true
> s3.access-key: $s3AccessKey
> s3.secret-key: $s3SecretKey
> {code}
> (the substitutions are controlled by terraform).
> h2. Tests
> h4. Test 1 (No key rotation)
> 8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
> indefinitely. Actors (keys) never rotate in or out.
> h4. Test 2 (Constant key rotation)
> 8192 actors that produce 2 Scan events 10 minutes apart, then retire and 
> never emit again. The setup creates new actors (keys) as soon as one finishes 
> so we always have 8192. This test basically constantly rotates the key space.
> h2. Results
> For both tests, the state size (checkpoint size) grows unbounded and linearly 
> well past the 30 minute threshold that should have caused old keys or events 
> to be discard from the state. In the chart below, the left (steep) half is 
> the 24 hours we ran Test 1, the right (shallow) half is Test 2. 
> !image-2020-11-04-11-35-12-126.png! 
> Could someone please assist us with this? Unless we have dramatically 
> misunderstood how the CEP library is supposed to function this seems like a 
> pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Description: 
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2. 

!image-2020-11-04-11-35-12-126.png! 

Could someone please assist us with this? Unless we have dramatically 
misunderstood how the CEP library is supposed to function this seems like a 
pretty severe bug.

  was:
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2. 

!image-2020-11-04-11-35-12-126.png|thumbnail! 

Could someone please assist us 

[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-04 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-19970:

Description: 
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2. 

!image-2020-11-04-11-35-12-126.png|thumbnail! 

Could someone please assist us with this? Unless we have dramatically 
misunderstood how the CEP library is supposed to function this seems like a 
pretty severe bug.

  was:
We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2. 

 !image-2020-11-04-11-35-12-126.png|thumbnail! 

Could someone please 

[jira] [Created] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)

2020-11-04 Thread Thomas Wozniakowski (Jira)
Thomas Wozniakowski created FLINK-19970:
---

 Summary: State leak in CEP Operators (expired events/keys not 
removed from state)
 Key: FLINK-19970
 URL: https://issues.apache.org/jira/browse/FLINK-19970
 Project: Flink
  Issue Type: Bug
  Components: Library / CEP
Affects Versions: 1.11.2
 Environment: Flink 1.11.2 run using the official docker containers in 
AWS ECS Fargate.

1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory
Reporter: Thomas Wozniakowski
 Attachments: image-2020-11-04-11-35-12-126.png

We have been observing instability in our production environment recently, 
seemingly related to state backends. We ended up building a load testing 
environment to isolate factors and have discovered that the CEP library appears 
to have some serious problems with state expiry.

h2. Job Topology

Source: Kinesis (standard connector) -> keyBy() and forward to...
CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward 
output to...
Sink: SQS (custom connector)

The CEP Patterns in the test look like this:

{code:java}
Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(20)
.subtype(ScanEvent.class)
.within(Duration.minutes(30));
{code}

h2. Taskmanager Config

{code:java}
taskmanager.numberOfTaskSlots: $numberOfTaskSlots
taskmanager.data.port: 6121
taskmanager.rpc.port: 6122
taskmanager.exit-on-fatal-akka-error: true
taskmanager.memory.process.size: $memoryProcessSize
taskmanager.memory.jvm-metaspace.size: 256m
taskmanager.memory.managed.size: 0m
jobmanager.rpc.port: 6123
blob.server.port: 6130
rest.port: 8081
web.submit.enable: true
fs.s3a.connection.maximum: 50
fs.s3a.threads.max: 50
akka.framesize: 250m
akka.watch.threshold: 14

state.checkpoints.dir: s3://$savepointBucketName/checkpoints
state.savepoints.dir: s3://$savepointBucketName/savepoints
state.backend: filesystem
state.backend.async: true

s3.access-key: $s3AccessKey
s3.secret-key: $s3SecretKey
{code}

(the substitutions are controlled by terraform).

h2. Tests

h4. Test 1 (No key rotation)
8192 actors (different keys) emitting 1 Scan Event every 10 minutes 
indefinitely. Actors (keys) never rotate in or out.

h4. Test 2 (Constant key rotation)
8192 actors that produce 2 Scan events 10 minutes apart, then retire and never 
emit again. The setup creates new actors (keys) as soon as one finishes so we 
always have 8192. This test basically constantly rotates the key space.

h2. Results

For both tests, the state size (checkpoint size) grows unbounded and linearly 
well past the 30 minute threshold that should have caused old keys or events to 
be discard from the state. In the chart below, the left (steep) half is the 24 
hours we ran Test 1, the right (shallow) half is Test 2. 

 !image-2020-11-04-11-35-12-126.png|thumbnail! 

Could someone please assist us with this? Unless we have dramatically 
misunderstood how the CEP library is supposed to function this seems like a 
pretty severe bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19293) RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore

2020-09-18 Thread Thomas Wozniakowski (Jira)
Thomas Wozniakowski created FLINK-19293:
---

 Summary: RocksDB last_checkpoint.state_size grows endlessly until 
savepoint/restore
 Key: FLINK-19293
 URL: https://issues.apache.org/jira/browse/FLINK-19293
 Project: Flink
  Issue Type: Bug
  Components: Library / CEP, Runtime / Checkpointing
Affects Versions: 1.10.1
Reporter: Thomas Wozniakowski
 Attachments: Screenshot 2020-09-18 at 13.58.30.png

Hi Guys,

I am seeing some strange behaviour that may be a bug, or may just be intended.

We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and 2 
TaskManagers, parallelism 4. The job itself is simple:

# Source: kinesis connector reading from a single shard stream
# CEP: ~25 CEP Keyed Pattern operators watching the event stream for different 
kinds of behaviour. They all have ".withinSeconds()" applied. Nothing is 
set up to grow endlessly.
# Sink: Single operator writing messages to SQS (custom code)

We are seeing the checkpoint size grow constantly until the job is restarted 
using a savepoint/restore. The size continues to grow past the point that the 
".withinSeconds()" limits should cause old data to be discarded. The growth 
is also out of proportion to the general platform growth (which is actually 
trending down at the moment due to COVID).

I've attached a snapshot from our monitoring dashboard below. You can see the 
huge drops in state_size on a savepoint/restore.

Our state configuration is as follows:

Backend: RocksDB
Mode: EXACTLY_ONCE
Max Concurrent: 1
Externalised Checkpoints: RETAIN_ON_CANCELLATION
Async: TRUE
Incremental: TRUE
TTL Compaction Filter enabled: TRUE

We are worried that the CEP library may be leaking state somewhere, leaving 
some objects not cleaned up. Unfortunately I can't share one of these 
checkpoints with the community due to the sensitive nature of the data 
contained within, but if anyone has any suggestions for how I could analyse the 
checkpoints to look for leaks, please let me know.

Thanks in advance for the help



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers

2020-03-23 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064889#comment-17064889
 ] 

Thomas Wozniakowski commented on FLINK-16588:
-

I think that's a fair enough assessment. I'll bug Amazon to see if I can get 
them to add the disk space metric to Fargate externally.

> Add Disk Space metrics to TaskManagers
> --
>
> Key: FLINK-16588
> URL: https://issues.apache.org/jira/browse/FLINK-16588
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Minor
>
> Hi,
> We have recently switched to the RocksDB state backend. We are scraping 
> Taskmanager metrics from the REST endpoints to watch for memory and CPU 
> issues, but we currently have no good way to get the remaining disk space, so 
> we have no way of knowing when RocksDB is going to run out of space for state 
> storage.
> How plausible is it to add something like a {{State.FreeStorageBytes}} metric 
> or something similar to the standard TaskManager metrics set?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers

2020-03-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063342#comment-17063342
 ] 

Thomas Wozniakowski commented on FLINK-16588:
-

So we're running them in AWS Fargate. The containers in Fargate are configured 
on your behalf and you can't change the local storage. You also can't access 
the {{docker}} commands to get information out of then. AWS does not provide 
any visible metrics from outside about remaining disk space so it unfortunately 
looks like this information will have to come from within...

> Add Disk Space metrics to TaskManagers
> --
>
> Key: FLINK-16588
> URL: https://issues.apache.org/jira/browse/FLINK-16588
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Minor
>
> Hi,
> We have recently switched to the RocksDB state backend. We are scraping 
> Taskmanager metrics from the REST endpoints to watch for memory and CPU 
> issues, but we currently have no good way to get the remaining disk space, so 
> we have no way of knowing when RocksDB is going to run out of space for state 
> storage.
> How plausible is it to add something like a {{State.FreeStorageBytes}} metric 
> or something similar to the standard TaskManager metrics set?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers

2020-03-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063201#comment-17063201
 ] 

Thomas Wozniakowski commented on FLINK-16588:
-

Hey [~gjy],

I take your point, but as a user of Flink I try to stay as close as possible to 
the vanilla versions (best documented, best supported, etc). For us, that means 
using the official Flink docker images. On Amazon, AWS does not provide any way 
to observe the remaining disk space from OUTSIDE a container, so that only 
leaves us one option: monitor from inside. From our perspective that can be 
achieved 2 ways:

# Fork the official docker image and add something like nagios to it
# Upgrade Flink so it's included in the official docker image by default

We obviously prefer the second option because then we're not maintaining our 
own image :)

> Add Disk Space metrics to TaskManagers
> --
>
> Key: FLINK-16588
> URL: https://issues.apache.org/jira/browse/FLINK-16588
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Minor
>
> Hi,
> We have recently switched to the RocksDB state backend. We are scraping 
> Taskmanager metrics from the REST endpoints to watch for memory and CPU 
> issues, but we currently have no good way to get the remaining disk space, so 
> we have no way of knowing when RocksDB is going to run out of space for state 
> storage.
> How plausible is it to add something like a {{State.FreeStorageBytes}} metric 
> or something similar to the standard TaskManager metrics set?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16588) Add Disk Space metrics to TaskManagers

2020-03-13 Thread Thomas Wozniakowski (Jira)
Thomas Wozniakowski created FLINK-16588:
---

 Summary: Add Disk Space metrics to TaskManagers
 Key: FLINK-16588
 URL: https://issues.apache.org/jira/browse/FLINK-16588
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Affects Versions: 1.10.0
Reporter: Thomas Wozniakowski


Hi,

We have recently switched to the RocksDB state backend. We are scraping 
Taskmanager metrics from the REST endpoints to watch for memory and CPU issues, 
but we currently have no good way to get the remaining disk space, so we have 
no way of knowing when RocksDB is going to run out of space for state storage.

How plausible is it to add something like a {{State.FreeStorageBytes}} metric 
or something similar to the standard TaskManager metrics set?

Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-03-04 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051086#comment-17051086
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hey [~xintongsong], we set the metaspace to 256m that that seemed to do the 
trick

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-26 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045467#comment-17045467
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Yeah, increasing the metaspace seems to resolve the problems. I think you might 
be right. Class unloading doesn't seem to happen aggressively as it is needed, 
i.e. if you want to load some classes in and there's no space, it doesn't 
trigger a stop-the-world style mega-GC that would free up the space.

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-24 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043658#comment-17043658
 ] 

Thomas Wozniakowski edited comment on FLINK-16142 at 2/24/20 4:36 PM:
--

Hey [~sewen] - we've applied the fix on our build by excluding:

{code:groovy}
exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 
'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
{code}

The results have been interesting. It certainly seems to _help_ the problem, 
but we're still seeing OOM errors in our builds where jobs are rapidly started 
and stopped. Looking at the TaskManager metrics, I can see that the classes 
actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 
52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something 
is alive and is hanging onto it for a few seconds after the job exits.

I've attached another heap dump, could you help us track down what might be 
causing this final wrinkle? Feels like we're close here!

Edit: I somewhat stupidly gave the second heap dump the same name. It's the one 
that's uploaded more recently, sorry about that!


was (Author: jamalarm):
Hey [~sewen] - we've applied the fix on our build by excluding:

{code:groovy}
exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 
'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
{code}

The results have been interesting. It certainly seems to _help_ the problem, 
but we're still seeing OOM errors in our builds where jobs are rapidly started 
and stopped. Looking at the TaskManager metrics, I can see that the classes 
actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 
52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something 
is alive and is hanging onto it for a few seconds after the job exits.

I've attached another heap dump, could you help us track down what might be 
causing this final wrinkle? Feels like we're close here!

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-24 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043658#comment-17043658
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hey [~sewen] - we've applied the fix on our build by excluding:

{code:groovy}
exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 
'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
{code}

The results have been interesting. It certainly seems to _help_ the problem, 
but we're still seeing OOM errors in our builds where jobs are rapidly started 
and stopped. Looking at the TaskManager metrics, I can see that the classes 
actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 
52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something 
is alive and is hanging onto it for a few seconds after the job exits.

I've attached another heap dump, could you help us track down what might be 
causing this final wrinkle? Feels like we're close here!

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> 

[jira] [Updated] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-24 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-16142:

Attachment: java_pid1.hprof

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042027#comment-17042027
 ] 

Thomas Wozniakowski edited comment on FLINK-16142 at 2/21/20 5:14 PM:
--

[~sewen]
I don't have access to the Kinesis Source code as it's a library, but I added 
that line to the SQS sink, as it's also going to be executed on the TaskManager 
alongside the Kinesis source (my test is only running on one taskmanager). 
Unfortunately it did not prevent the OOM error.


was (Author: jamalarm):
[~sewen]
I don't have access to the Kinesis Source code as it's a library, but I added 
that line to the SQS sink, as it's also going to be executed on the TaskManager 
alongside the Kinesis sink (my test is only running on one taskmanager). 
Unfortunately it did not prevent the OOM error.

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042027#comment-17042027
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

[~sewen]
I don't have access to the Kinesis Source code as it's a library, but I added 
that line to the SQS sink, as it's also going to be executed on the TaskManager 
alongside the Kinesis sink (my test is only running on one taskmanager). 
Unfortunately it did not prevent the OOM error.

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041933#comment-17041933
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Just in case it is relevant, given we're talking about relocating classes. When 
we initially implemented this system it was intended to run on EMR. Due to the 
truly insane number of JARs AWS puts on the classpath (over 200) I had to 
painstakingly move a few things around in order to prevent clashes. We actually 
don't run on EMR anymore but the relocation is still there. It uses the Gradle 
Shadow plugin, config here:

{code:groovy}
shadowJar {
archiveName = "pattern-detector-realtime.jar"

relocate('com.amazonaws', 'com.masabi.pattern.shaded.com.amazonaws') {
exclude 'com.amazonaws.handlers.*'
exclude 'com.amazonaws.services.sqs.QueueUrlHandler'
exclude 'com.amazonaws.services.sqs.internal.SQSRequestHandler'
exclude 'com.amazonaws.services.sqs.MessageMD5ChecksumHandler'
}

exclude 'amazon-kinesis-producer-native-binaries/**'
exclude 'cacerts/*'
}
{code}

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041919#comment-17041919
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

[~pnowojski] We load the S3 plugin via this: 
https://github.com/docker-flink/docker-flink/pull/94 which I actually 
contributed myself. Currently in production we use a different method but we're 
replatforming our Flink usage to docker so this is how we do it now. It's 
therefore being loaded into the {{plugin}} directly properly and not being put 
into {{lib}}

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041816#comment-17041816
 ] 

Thomas Wozniakowski edited comment on FLINK-16142 at 2/21/20 12:48 PM:
---

Hi [~sewen]

I've attached the heap dump. It was actually surprisingly straightforward to 
take it and get it out of the container. Apologies for not getting it done 
sooner.


was (Author: jamalarm):
Hi [~sewen]

I've attached the heap dump. It was actually surprisingly straightforward to 
take it and get it out of the container. Apologies for not getting it done 
sooner.

[^java_pid1.hprof] 

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by 

[jira] [Updated] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-16142:

Attachment: java_pid1.hprof

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041816#comment-17041816
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hi [~sewen]

I've attached the heap dump. It was actually surprisingly straightforward to 
take it and get it out of the container. Apologies for not getting it done 
sooner.

[^java_pid1.hprof] 

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-21 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041683#comment-17041683
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hi [~sewen], here is the first chunk of the logs with all the config parts:

{code}
Starting Task Manager
config file: 
jobmanager.rpc.address: pattern-detector-e2e-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.memory.process.size: 1568m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 1
jobmanager.execution.failover-strategy: region
blob.server.port: 6124
query.server.port: 6125
Starting taskexecutor as a console application on host 1ef836eff98e.
2020-02-21 08:46:50,418 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 

2020-02-21 08:46:50,422 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  
Preconfiguration: 
2020-02-21 08:46:50,423 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 


TM_RESOURCES_JVM_PARAMS extraction logs:
 - Loading configuration property: jobmanager.rpc.address, 
pattern-detector-e2e-jobmanager
 - Loading configuration property: jobmanager.rpc.port, 6123
 - Loading configuration property: jobmanager.heap.size, 1024m
 - Loading configuration property: taskmanager.memory.process.size, 1568m
 - Loading configuration property: taskmanager.numberOfTaskSlots, 2
 - Loading configuration property: parallelism.default, 1
 - Loading configuration property: jobmanager.execution.failover-strategy, 
region
 - Loading configuration property: blob.server.port, 6124
 - Loading configuration property: query.server.port, 6125
 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) 
is less than its min value 192.000mb (201326592 bytes), min value will be used 
instead
BASH_JAVA_UTILS_EXEC_RESULT:-Xmx536870902 -Xms536870902 
-XX:MaxDirectMemorySize=268435458 -XX:MaxMetaspaceSize=100663296

TM_RESOURCES_DYNAMIC_CONFIGS extraction logs:
 - Loading configuration property: jobmanager.rpc.address, 
pattern-detector-e2e-jobmanager
 - Loading configuration property: jobmanager.rpc.port, 6123
 - Loading configuration property: jobmanager.heap.size, 1024m
 - Loading configuration property: taskmanager.memory.process.size, 1568m
 - Loading configuration property: taskmanager.numberOfTaskSlots, 2
 - Loading configuration property: parallelism.default, 1
 - Loading configuration property: jobmanager.execution.failover-strategy, 
region
 - Loading configuration property: blob.server.port, 6124
 - Loading configuration property: query.server.port, 6125
 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) 
is less than its min value 192.000mb (201326592 bytes), min value will be used 
instead
BASH_JAVA_UTILS_EXEC_RESULT:-D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.network.max=134217730b -D 
taskmanager.memory.network.min=134217730b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=536870920b -D taskmanager.cpu.cores=2.0 -D 
taskmanager.memory.task.heap.size=402653174b -D 
taskmanager.memory.task.off-heap.size=0b 

2020-02-21 08:46:50,423 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 

2020-02-21 08:46:50,424 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Starting 
TaskManager (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
2020-02-21 08:46:50,425 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  OS current 
user: flink
2020-02-21 08:46:50,426 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Current 
Hadoop/Kerberos user: 
2020-02-21 08:46:50,426 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM: OpenJDK 
64-Bit Server VM - Oracle Corporation - 1.8/25.242-b08
2020-02-21 08:46:50,426 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Maximum heap 
size: 512 MiBytes
2020-02-21 08:46:50,427 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JAVA_HOME: 
/usr/local/openjdk-8
2020-02-21 08:46:50,427 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  No Hadoop 
Dependency available
2020-02-21 08:46:50,428 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM Options:
2020-02-21 08:46:50,428 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -XX:+UseG1GC
2020-02-21 08:46:50,428 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-Xmx536870902
2020-02-21 08:46:50,428 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-Xms536870902
2020-02-21 08:46:50,429 INFO  

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041156#comment-17041156
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hey [~sewen] we are using the official Flink Docker containers, with no 
explicit JVM overrides. Whatever the default is, we're using.

It might not be relevant, but our Job JAR is compiled using the ECJ compiler 
and not the standard JDK compiler. Would that matter?

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041059#comment-17041059
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

[~arvid heise] I managed to get hold of those metaspace stats you asked for 
from inside the docker container. For reference, the easiest way I found to 
actually achieve this is to docker exec your way into the container, install 
sdkman and then do "sdk install java 8.0.242-open". This seems to give you a 
jmap command that is compatible with the running JVM.

{code:java}
root@27da7e6b6873:/usr/local/openjdk-8# jmap -clstats 1
Attaching to process ID 1, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
finding class loader instances ..done.
computing per loader stat ..done.
please wait.. computing liveness.liveness analysis may be inaccurate ...
class_loaderclasses bytes   parent_loader   alive?  type

 24664237013   null  live
0xe0201820  0   0   0xe0170730  dead
java/util/ResourceBundle$RBClassLoader@0x00010007c028
0xe0201a20  18  30389 null  dead
sun/misc/Launcher$ExtClassLoader@0x0001f6b0
0xe0d88ed8  1   14550xe0170730  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe130bd30  1   864 0xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0c982c0  1   14550xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe201e008  1   14570xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0c620c8  1   866 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1ffb1f0  1   866 0xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0d60cc0  1   14550xe0170730  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0cec4d0  1   864 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1ff9fe0  1   864 0xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1ffafe0  1   14550xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0cec0e8  1   864 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe130c118  1   864 0xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe201ce28  1   864 0xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0ced2e0  1   864 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe130c500  1   864 0xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0d272e0  1   1457  null  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xffe25bd0  1   1455  null  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0cecef8  1   864 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1ffa3c8  1   864 0xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0c98af0  1   14570xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1027138  1   873 0xe0399898  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe12da110  1   14550xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe130c370  1   864 0xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe201ca40  1   864 0xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0dda298  1   14560xe0d60c60  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe12da368  1   14570xe1027c38  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0ced088  1   864 0xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe1ffb5b8  1   14570xe1fa86d8  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe0c98c80  1   14550xe0790330  dead
sun/reflect/DelegatingClassLoader@0x00019c70
0xe201d648  1   1494

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041005#comment-17041005
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

On further investigation, I can see that this issue (of classes not being 
unloaded) actually exists on our 1.9.2 deployment as well, it just doesn't seem 
to cause the OOM error (presumably because the limit is higher).

Restarting my job against our remote cluster and polling the 
Status.JVM.ClassLoader.ClassesLoaded metric shows the number increasing by 3000 
each time. Classes appear to never be unloaded...

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040885#comment-17040885
 ] 

Thomas Wozniakowski edited comment on FLINK-16142 at 2/20/20 11:54 AM:
---

Here is a threaddump (just the last one this time, before the OOM):

{code}
THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: 
pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: CloseableReaperThread (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: DestroyJavaVM (RUNNABLE) CCL:null
THREAD: Finalizer (WAITING) CCL:null
THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Hashed wheel timer #1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: IOManager reader thread #1 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: IOManager writer thread #1 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O boss #3 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O boss #9 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O server boss #12 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O server boss #6 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #1 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #10 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #11 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #2 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #4 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #5 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #7 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #8 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> 
Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL 
keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: Reference Handler (WAITING) CCL:null
THREAD: Signal Dispatcher (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events 
-> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], 
Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: Timer-0 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Timer-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Timer-5 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-2 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-5 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.remote.default-remote-dispatcher-15 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-2 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-scheduler-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-scheduler-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: pool-3-thread-1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
{code}

I only see two classloaders at play here, the 
sun.misc.Launcher$AppClassLoader@75b84c92 and 
org.apache.flink.util.ChildFirstClassLoader@1a5d4e34. I think that looks ok 
right, the AppClassLoader is just the default Flink one, and the 
ChildFirstClassLoader is the one for that current running Job?

Edit: No luck with IdleConnectionReaper.shutdown(); unfortunately. Still OOM


was (Author: jamalarm):
Here is a threaddump (just the last one this time, before the OOM):

{code}
THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: 
pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-20 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040885#comment-17040885
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Here is a threaddump (just the last one this time, before the OOM):

{code}
THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: 
pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: CloseableReaperThread (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: DestroyJavaVM (RUNNABLE) CCL:null
THREAD: Finalizer (WAITING) CCL:null
THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Hashed wheel timer #1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: IOManager reader thread #1 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: IOManager writer thread #1 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O boss #3 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O boss #9 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O server boss #12 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O server boss #6 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #1 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #10 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #11 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #2 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #4 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #5 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #7 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: New I/O worker #8 (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> 
Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL 
keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: Reference Handler (WAITING) CCL:null
THREAD: Signal Dispatcher (RUNNABLE) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events 
-> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], 
Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) 
CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34
THREAD: Timer-0 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Timer-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: Timer-5 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-2 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.actor.default-dispatcher-5 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.remote.default-remote-dispatcher-15 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-2 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-metrics-scheduler-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: flink-scheduler-1 (TIMED_WAITING) 
CCL:sun.misc.Launcher$AppClassLoader@75b84c92
THREAD: pool-3-thread-1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92
{code}

I only see two classloaders at play here, the 
sun.misc.Launcher$AppClassLoader@75b84c92 and 
org.apache.flink.util.ChildFirstClassLoader@1a5d4e34. I think that looks ok 
right, the AppClassLoader is just the default Flink one, and the 
ChildFirstClassLoader is the one for that current running Job?

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
> 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-19 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040239#comment-17040239
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

After first run
{code:json}
[
{
"id": "Status.JVM.ClassLoader.ClassesLoaded",
"min": 10385.0,
"max": 10385.0,
"avg": 10385.0,
"sum": 10385.0
},
{
"id": "Status.JVM.ClassLoader.ClassesUnloaded",
"min": 0.0,
"max": 0.0,
"avg": 0.0,
"sum": 0.0
}
]
{code}

After second run
{code:json}
[
{
"id": "Status.JVM.ClassLoader.ClassesLoaded",
"min": 13063.0,
"max": 13063.0,
"avg": 13063.0,
"sum": 13063.0
},
{
"id": "Status.JVM.ClassLoader.ClassesUnloaded",
"min": 67.0,
"max": 67.0,
"avg": 67.0,
"sum": 67.0
}
]
{code}

After third run
{code:json}
[
{
"id": "Status.JVM.ClassLoader.ClassesLoaded",
"min": 15506.0,
"max": 15506.0,
"avg": 15506.0,
"sum": 15506.0
},
{
"id": "Status.JVM.ClassLoader.ClassesUnloaded",
"min": 67.0,
"max": 67.0,
"avg": 67.0,
"sum": 67.0
}
]
{code}

Definitely seems like the classes aren't being unloaded?


> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-19 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040163#comment-17040163
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Ok, I put the following in the OPEN method of our custom SQS sink (just because 
it's somewhere in the job it's easy to run arbitrary code from):

{code:java}
Thread.getAllStackTraces().keySet().stream().sorted(Comparator.comparing(Thread::getName)).forEach(thread
 ->
System.out.printf("THREAD: %s (%s)%n", thread.getName(), 
thread.getState().toString())
);
{code}

The only bits I have changed are the blocks of XXXs which are client-specific 
stuff that I can't post. They're just the names of CEP operators in the job.

First run output:
{code:java}
THREAD: XX -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) 
(RUNNABLE)
THREAD: CloseableReaperThread (WAITING)
THREAD: DestroyJavaVM (RUNNABLE)
THREAD: Finalizer (WAITING)
THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE)
THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING)
THREAD: Hashed wheel timer #1 (TIMED_WAITING)
THREAD: IOManager reader thread #1 (WAITING)
THREAD: IOManager writer thread #1 (WAITING)
THREAD: New I/O boss #3 (RUNNABLE)
THREAD: New I/O boss #9 (RUNNABLE)
THREAD: New I/O server boss #12 (RUNNABLE)
THREAD: New I/O server boss #6 (RUNNABLE)
THREAD: New I/O worker #1 (RUNNABLE)
THREAD: New I/O worker #10 (RUNNABLE)
THREAD: New I/O worker #11 (RUNNABLE)
THREAD: New I/O worker #2 (RUNNABLE)
THREAD: New I/O worker #4 (RUNNABLE)
THREAD: New I/O worker #5 (RUNNABLE)
THREAD: New I/O worker #7 (RUNNABLE)
THREAD: New I/O worker #8 (RUNNABLE)
THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> 
Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL 
keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING)
THREAD: Reference Handler (WAITING)
THREAD: Signal Dispatcher (RUNNABLE)
THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events 
-> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], 
Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE)
THREAD: Timer-0 (TIMED_WAITING)
THREAD: Timer-1 (TIMED_WAITING)
THREAD: flink-akka.actor.default-dispatcher-2 (TIMED_WAITING)
THREAD: flink-akka.actor.default-dispatcher-3 (WAITING)
THREAD: flink-akka.actor.default-dispatcher-4 (WAITING)
THREAD: flink-akka.actor.default-dispatcher-5 (WAITING)
THREAD: flink-akka.remote.default-remote-dispatcher-6 (TIMED_WAITING)
THREAD: flink-akka.remote.default-remote-dispatcher-7 (WAITING)
THREAD: flink-metrics-2 (TIMED_WAITING)
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING)
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING)
THREAD: flink-metrics-scheduler-1 (TIMED_WAITING)
THREAD: flink-scheduler-1 (TIMED_WAITING)
THREAD: pool-3-thread-1 (TIMED_WAITING)
{code}

Second run output:
{code:java}
THREAD: XX -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) 
(RUNNABLE)
THREAD: CloseableReaperThread (WAITING)
THREAD: DestroyJavaVM (RUNNABLE)
THREAD: Finalizer (WAITING)
THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE)
THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING)
THREAD: Hashed wheel timer #1 (TIMED_WAITING)
THREAD: IOManager reader thread #1 (WAITING)
THREAD: IOManager writer thread #1 (WAITING)
THREAD: New I/O boss #3 (RUNNABLE)
THREAD: New I/O boss #9 (RUNNABLE)
THREAD: New I/O server boss #12 (RUNNABLE)
THREAD: New I/O server boss #6 (RUNNABLE)
THREAD: New I/O worker #1 (RUNNABLE)
THREAD: New I/O worker #10 (RUNNABLE)
THREAD: New I/O worker #11 (RUNNABLE)
THREAD: New I/O worker #2 (RUNNABLE)
THREAD: New I/O worker #4 (RUNNABLE)
THREAD: New I/O worker #5 (RUNNABLE)
THREAD: New I/O worker #7 (RUNNABLE)
THREAD: New I/O worker #8 (RUNNABLE)
THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> 
Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL 
keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING)
THREAD: Reference Handler (WAITING)
THREAD: Signal Dispatcher (RUNNABLE)
THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events 
-> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], 
Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE)
THREAD: Timer-0 (TIMED_WAITING)
THREAD: Timer-1 (TIMED_WAITING)
THREAD: flink-akka.actor.default-dispatcher-2 (TIMED_WAITING)
THREAD: flink-akka.actor.default-dispatcher-3 (WAITING)
THREAD: flink-akka.actor.default-dispatcher-4 (WAITING)
THREAD: flink-akka.actor.default-dispatcher-5 (WAITING)
THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING)
THREAD: flink-akka.remote.default-remote-dispatcher-7 (TIMED_WAITING)
THREAD: flink-metrics-2 (TIMED_WAITING)
THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (TIMED_WAITING)
THREAD: 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-19 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040146#comment-17040146
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

[~kevin.cyj] it would be quite difficult for me to put arbitrary JARs in the 
lib folder as we're running entirely using the official docker containers and 
the app is specifically set up to run in docker (reads everything from 
environment variables, etc). I'm going to try and get the main method of the 
JAR to print all the live threads to the logs at the point where the job 
starts, hopefully that should give some insight. 

I can post the code of our custom SQS sink, but it's really only about 4 lines.

[~pnowojski] similar problem with attaching a memory profiler, the official 
Flink docker images don't expose a port for attaching to the JVM so I don't 
think that one will be possible. I'm trying to work out if there are any other 
useful metrics I could monitor on the REST endpoints from outside the container 
to help debug this.

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the 

[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-18 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039260#comment-17039260
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Explicitly calling `.shutdown()` on the SQS client in the Sink did not prevent 
the OOM Metaspace error.

I'm trying to monitor what's happening from the outside using metrics, looking 
at Status.JVM.Threads.Count I can see that the thread count is climbing a 
little bit as I keep restarting the jobs, but only by a few threads (~90 in 
total at rest, ~100 after one submission, ~110 after two, fails on three).

Would you expect a small number of threads to cause this issue?

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian 

[jira] [Commented] (FLINK-16142) Memory Le!k causes Metaspace OOM error on repeated job submission

2020-02-18 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039227#comment-17039227
 ] 

Thomas Wozniakowski commented on FLINK-16142:
-

Hi Guys,

Thanks for the speedy response.

It's going to be a little tricky to get at the heap dumps, as our application 
is very specifically written to only run on Flink running in docker. Do you 
know if there are any configuration options that we could pass in from the 
outside that would encourage the dockerised Flink to dump some more useful 
information to the logs when the OOM-metaspace error occurs?

I'll take a look at our connectors. We are currently using:

- Kinesis source, official build from Maven (now that it is published there). 
Prior to 1.10.0 we built it from source internally.
- Custom SQS sink. Basically just dumps events straight onto a queue using the 
AWS SDK (version 1, blocking). This code has not changed since it was 
originally written (for Flink 1.3). 

We do not explicitly shut down the threads that presumably run in the 
background for the SQS SDK. I will have a look and see if there's a way we can 
explicitly close them when the job shuts down.

Any other ideas of places I can look, give me a shout.

Tom

> Memory Leak causes Metaspace OOM error on repeated job submission
> -
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.10.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> 

[jira] [Created] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission

2020-02-18 Thread Thomas Wozniakowski (Jira)
Thomas Wozniakowski created FLINK-16142:
---

 Summary: Memory Leak causes Metaspace OOM error on repeated job 
submission
 Key: FLINK-16142
 URL: https://issues.apache.org/jira/browse/FLINK-16142
 Project: Flink
  Issue Type: Bug
  Components: Client / Job Submission
Affects Versions: 1.10.0
Reporter: Thomas Wozniakowski


Hi Guys,

We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
use-case exactly (RocksDB state backend running in a containerised cluster). 
Unfortunately, it seems like there is a memory leak somewhere in the job 
submission logic. We are getting this error:


{code:java}
2020-02-18 10:22:10,020 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME switched 
from RUNNING to FAILED.
java.lang.OutOfMemoryError: Metaspace
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at 
org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
at 
org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
at 
org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
at 
org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
at 
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
at 
org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
{code}

(The only change in the above text is the OPERATOR_NAME text where I removed 
some of the internal specifics of our system).

This will reliably happen on a fresh cluster after submitting and cancelling 
our job 3 times.

We are using the presto-s3 plugin, the CEP library and the Kinesis connector.

Please let me know what other diagnostics would be useful.

Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.

2020-01-31 Thread Thomas Wozniakowski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027328#comment-17027328
 ] 

Thomas Wozniakowski commented on FLINK-14812:
-

FYI - I am implementing a slightly inelegant version of this in the official 
docker-flink images via the entry point script. 
https://github.com/docker-flink/docker-flink/pull/94
Though if this were to be handled inside Flink (preferable) it could be removed 
in the future.

> Add custom libs to Flink classpath with an environment variable.
> 
>
> Key: FLINK-14812
> URL: https://issues.apache.org/jira/browse/FLINK-14812
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes, Deployment / Scripts
>Reporter: Eui Heo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To use plugin library you need to add it to the flink classpath. The 
> documentation explains to put the jar file in the lib path.
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the 
> burden of creating and managing another container image. It would be more 
> efficient to add the classpath using environment variables inside the 
> constructFlinkClassPath function in the config.sh file.
> In particular, it seems inconvenient for me to create separate images to use 
> the jars, even though the /opt/ flink/opt of the stock image already contains 
> them.
> For example, there are metrics libs and file system libs:
> flink-azure-fs-hadoop-1.9.1.jar
> flink-s3-fs-hadoop-1.9.1.jar
> flink-metrics-prometheus-1.9.1.jar
> flink-metrics-influxdb-1.9.1.jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-10960) CEP: Job Failure when .times(2) is used

2019-01-03 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski closed FLINK-10960.
---
  Resolution: Workaround
Release Note: It seems this issue was as David described, due to restoring 
into a smaller state machine than had existed before

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Critical
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP operators matching the config file.
> I encountered a strange bug when I was testing with some artificially low 
> numbers in our testing environment today. The CEP code we're using (modified 
> slightly) is:
> {code:java}
> Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(config.getNumberOfUniqueEvents())
> .where(uniquenessCheckOnAlreadyMatchedEvents())
> .within(seconds(config.getWithinSeconds()));
> {code}
> When using the {{numberOfUniqueEvents: 2}}, I started seeing the following 
> error killing the job whenever a match was detected:
> {quote}
> ava.lang.RuntimeException: Exception occurred while processing valve output 
> watermark: 
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
> does not exist in the NFA. NFA has states [Final State $endState$ [
> ]), Normal State eventSequence [
>   StateTransition(TAKE, from eventSequenceto $endState$, with condition),
>   StateTransition(IGNORE, from eventSequenceto eventSequence, with 
> condition),
> ]), Start State eventSequence:0 [
>   StateTransition(TAKE, from eventSequence:0to eventSequence, with 
> condition),
> ])]
>   at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
>   at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
>   at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
> {quote}
> Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. 
> Changing it back to 2 brought the problem back. It seems to be specifically 
> related to the value of 2.
> This is not a blocking issue for me because we typically use much higher 
> numbers than this in production anyway, but I figured you guys might want to 
> know about this issue.
> Let me know if you need any more information.
> Tom



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10960:

Priority: Critical  (was: Major)

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Critical
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP operators matching the config file.
> I encountered a strange bug when I was testing with some artificially low 
> numbers in our testing environment today. The CEP code we're using (modified 
> slightly) is:
> {code:java}
> Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(config.getNumberOfUniqueEvents())
> .where(uniquenessCheckOnAlreadyMatchedEvents())
> .within(seconds(config.getWithinSeconds()));
> {code}
> When using the {{numberOfUniqueEvents: 2}}, I started seeing the following 
> error killing the job whenever a match was detected:
> {quote}
> ava.lang.RuntimeException: Exception occurred while processing valve output 
> watermark: 
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
> does not exist in the NFA. NFA has states [Final State $endState$ [
> ]), Normal State eventSequence [
>   StateTransition(TAKE, from eventSequenceto $endState$, with condition),
>   StateTransition(IGNORE, from eventSequenceto eventSequence, with 
> condition),
> ]), Start State eventSequence:0 [
>   StateTransition(TAKE, from eventSequence:0to eventSequence, with 
> condition),
> ])]
>   at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
>   at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
>   at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
> {quote}
> Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. 
> Changing it back to 2 brought the problem back. It seems to be specifically 
> related to the value of 2.
> This is not a blocking issue for me because we typically use much higher 
> numbers than this in production anyway, but I figured you guys might want to 
> know about this issue.
> Let me know if you need any more information.
> Tom



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718899#comment-16718899
 ] 

Thomas Wozniakowski commented on FLINK-10960:
-

Ok, so a way to reproduce this should be:

1. Set up a job with, say {{.times(5)}}
2. Send through 4 events that satisfy the pattern
3. Stop and savepoint the job
4. Change the config to {{times.(3)}}
5. Restore the job from a savepoint?

And it should blow up because state 4 no longer exists?

I'll try and write an E2E test to reproduce this. If I understand correctly 
this should only happen when restoring a job where the {{.times(n)}} has 
*decreased*, it shouldn't have a problem when it has increased?

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Critical
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP operators matching the config file.
> I encountered a strange bug when I was testing with some artificially low 
> numbers in our testing environment today. The CEP code we're using (modified 
> slightly) is:
> {code:java}
> Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(config.getNumberOfUniqueEvents())
> .where(uniquenessCheckOnAlreadyMatchedEvents())
> .within(seconds(config.getWithinSeconds()));
> {code}
> When using the {{numberOfUniqueEvents: 2}}, I started seeing the following 
> error killing the job whenever a match was detected:
> {quote}
> ava.lang.RuntimeException: Exception occurred while processing valve output 
> watermark: 
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
> does not exist in the NFA. NFA has states [Final State $endState$ [
> ]), Normal State eventSequence [
>   StateTransition(TAKE, from eventSequenceto $endState$, with condition),
>   StateTransition(IGNORE, from eventSequenceto eventSequence, with 
> condition),
> ]), Start State eventSequence:0 [
>   StateTransition(TAKE, from eventSequence:0to eventSequence, with 
> condition),
> ])]
>   at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
>   at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
>   at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
> {quote}
> Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. 
> Changing it back to 2 brought the problem back. It seems to be specifically 
> related to the value of 2.
> This is not a blocking issue for me because we typically use much higher 
> numbers than this in production anyway, but I figured you guys might want to 
> know about this issue.
> Let me know if you need any more information.
> Tom



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718899#comment-16718899
 ] 

Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 12:11 PM:


Ok, so a way to reproduce this should be:

1. Set up a job with, say {{.times(5)}}
2. Send through 4 events that satisfy the pattern
3. Stop and savepoint the job
4. Change the config to {{.times(3)}}
5. Restore the job from a savepoint?

And it should blow up because state 4 no longer exists?

I'll try and write an E2E test to reproduce this. If I understand correctly 
this should only happen when restoring a job where the {{.times(n)}} has 
*decreased*, it shouldn't have a problem when it has increased?


was (Author: jamalarm):
Ok, so a way to reproduce this should be:

1. Set up a job with, say {{.times(5)}}
2. Send through 4 events that satisfy the pattern
3. Stop and savepoint the job
4. Change the config to {{times.(3)}}
5. Restore the job from a savepoint?

And it should blow up because state 4 no longer exists?

I'll try and write an E2E test to reproduce this. If I understand correctly 
this should only happen when restoring a job where the {{.times(n)}} has 
*decreased*, it shouldn't have a problem when it has increased?

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Critical
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP operators matching the config file.
> I encountered a strange bug when I was testing with some artificially low 
> numbers in our testing environment today. The CEP code we're using (modified 
> slightly) is:
> {code:java}
> Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(config.getNumberOfUniqueEvents())
> .where(uniquenessCheckOnAlreadyMatchedEvents())
> .within(seconds(config.getWithinSeconds()));
> {code}
> When using the {{numberOfUniqueEvents: 2}}, I started seeing the following 
> error killing the job whenever a match was detected:
> {quote}
> ava.lang.RuntimeException: Exception occurred while processing valve output 
> watermark: 
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
> does not exist in the NFA. NFA has states [Final State $endState$ [
> ]), Normal State eventSequence [
>   StateTransition(TAKE, from eventSequenceto $endState$, with condition),
>   StateTransition(IGNORE, from eventSequenceto eventSequence, with 
> condition),
> ]), Start State eventSequence:0 [
>   StateTransition(TAKE, from eventSequence:0to eventSequence, with 
> condition),
> ])]
>   at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
>   at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
>   at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
>   at 
> 

[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718899#comment-16718899
 ] 

Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 12:11 PM:


Ok, so a way to reproduce this should be:

1. Set up a job with, say {{.times(5)}}
2. Send through 4 events that satisfy the pattern
3. Stop and savepoint the job
4. Change the config to {{.times(3)}}
5. Restore the job from a savepoint?

And it should blow up because state 4 no longer exists?

I'll try and write an E2E test to reproduce this. If I understand correctly 
this should only happen when restoring a job where the {{.times( ... )}} has 
*decreased*, it shouldn't have a problem when it has increased?


was (Author: jamalarm):
Ok, so a way to reproduce this should be:

1. Set up a job with, say {{.times(5)}}
2. Send through 4 events that satisfy the pattern
3. Stop and savepoint the job
4. Change the config to {{.times(3)}}
5. Restore the job from a savepoint?

And it should blow up because state 4 no longer exists?

I'll try and write an E2E test to reproduce this. If I understand correctly 
this should only happen when restoring a job where the {{.times(n)}} has 
*decreased*, it shouldn't have a problem when it has increased?

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Critical
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP operators matching the config file.
> I encountered a strange bug when I was testing with some artificially low 
> numbers in our testing environment today. The CEP code we're using (modified 
> slightly) is:
> {code:java}
> Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .times(config.getNumberOfUniqueEvents())
> .where(uniquenessCheckOnAlreadyMatchedEvents())
> .within(seconds(config.getWithinSeconds()));
> {code}
> When using the {{numberOfUniqueEvents: 2}}, I started seeing the following 
> error killing the job whenever a match was detected:
> {quote}
> ava.lang.RuntimeException: Exception occurred while processing valve output 
> watermark: 
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
>   at 
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
> does not exist in the NFA. NFA has states [Final State $endState$ [
> ]), Normal State eventSequence [
>   StateTransition(TAKE, from eventSequenceto $endState$, with condition),
>   StateTransition(IGNORE, from eventSequenceto eventSequence, with 
> condition),
> ]), Start State eventSequence:0 [
>   StateTransition(TAKE, from eventSequence:0to eventSequence, with 
> condition),
> ])]
>   at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
>   at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
>   at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
>   at 
> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
>   at 
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
>   at 
> 

[jira] [Commented] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718816#comment-16718816
 ] 

Thomas Wozniakowski commented on FLINK-10960:
-

Hi [~dawidwys],

I'm hoping you might be able to point me in the right direction, we just had 
this same error in production with different parameters. Below is the error I 
got:


{code:text}
java.lang.RuntimeException: Exception occurred while processing valve output 
watermark:
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputStreamStatus(StatusWatermarkValve.java:152)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:188)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: State 
purchaseSequence:4 does not exist in the NFA. NFA has states [Normal State 
purchaseSequence [
StateTransition(TAKE, from purchaseSequenceto $endState$, with 
condition),
StateTransition(IGNORE, from purchaseSequenceto purchaseSequence, with 
condition),
]), Final State $endState$ [
]), Normal State purchaseSequence:2 [
StateTransition(TAKE, from purchaseSequence:2to purchaseSequence:1, 
with condition),
StateTransition(IGNORE, from purchaseSequence:2to purchaseSequence:2, 
with condition),
]), Start State purchaseSequence:3 [
StateTransition(TAKE, from purchaseSequence:3to purchaseSequence:2, 
with condition),
]), Normal State purchaseSequence:0 [
StateTransition(TAKE, from purchaseSequence:0to purchaseSequence, with 
condition),
StateTransition(IGNORE, from purchaseSequence:0to purchaseSequence:0, 
with condition),
]), Normal State purchaseSequence:1 [
StateTransition(TAKE, from purchaseSequence:1to purchaseSequence:0, 
with condition),
StateTransition(IGNORE, from purchaseSequence:1to purchaseSequence:1, 
with condition),
])]
at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
at 
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
{code}

I think this is something to do with savepoint restores. In this case we were 
making a config change that stopped the job with a savepoint, then started it 
again with slightly different parameters. One of these changed {{.times(8)}} to 
{{.times(5)}} on one of our CEP operators.

Our automated build process has E2E tests for this exact case, so I don't think 
it's a limitation in Flink. I'm a bit at a loss here to work out what the 
problem is. We got the job running by deleting the savepoint and starting the 
job from scratch.

What does that stacktrace suggest to you? I'm running lots of local tests with 
variants of 

1. Stopping job with savepoint
2. Changing config (driving changes in CEP.Pattern()) operators
3. Restarting the job from the savepoint

but I haven't managed to recreate the error locally...

> CEP: Job Failure when .times(2) is used
> ---
>
> Key: FLINK-10960
> URL: https://issues.apache.org/jira/browse/FLINK-10960
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.2
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> Hi Guys,
> Encountered a strange one today. We use the CEP library in a configurable way 
> where we plug a config file into the Flink Job JAR and it programmatically 
> sets up a bunch of CEP 

[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-12-12 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718816#comment-16718816
 ] 

Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 11:23 AM:


Hi [~dawidwys],

I'm hoping you might be able to point me in the right direction, we just had 
this same error in production with different parameters. Below is the error I 
got:


{code:java}
java.lang.RuntimeException: Exception occurred while processing valve output 
watermark:
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputStreamStatus(StatusWatermarkValve.java:152)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:188)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: State 
purchaseSequence:4 does not exist in the NFA. NFA has states [Normal State 
purchaseSequence [
StateTransition(TAKE, from purchaseSequenceto $endState$, with 
condition),
StateTransition(IGNORE, from purchaseSequenceto purchaseSequence, with 
condition),
]), Final State $endState$ [
]), Normal State purchaseSequence:2 [
StateTransition(TAKE, from purchaseSequence:2to purchaseSequence:1, 
with condition),
StateTransition(IGNORE, from purchaseSequence:2to purchaseSequence:2, 
with condition),
]), Start State purchaseSequence:3 [
StateTransition(TAKE, from purchaseSequence:3to purchaseSequence:2, 
with condition),
]), Normal State purchaseSequence:0 [
StateTransition(TAKE, from purchaseSequence:0to purchaseSequence, with 
condition),
StateTransition(IGNORE, from purchaseSequence:0to purchaseSequence:0, 
with condition),
]), Normal State purchaseSequence:1 [
StateTransition(TAKE, from purchaseSequence:1to purchaseSequence:0, 
with condition),
StateTransition(IGNORE, from purchaseSequence:1to purchaseSequence:1, 
with condition),
])]
at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
at 
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
{code}

I think this is something to do with savepoint restores. In this case we were 
making a config change that stopped the job with a savepoint, then started it 
again with slightly different parameters. One of these changed {{.times(8)}} to 
{{.times(5)}} on one of our CEP operators.

Our automated build process has E2E tests for this exact case, so I don't think 
it's a limitation in Flink. I'm a bit at a loss here to work out what the 
problem is. We got the job running by deleting the savepoint and starting the 
job from scratch.

What does that stacktrace suggest to you? I'm running lots of local tests with 
variants of 

1. Stopping job with savepoint
2. Changing config (driving changes in CEP.Pattern()) operators
3. Restarting the job from the savepoint

but I haven't managed to recreate the error locally...


was (Author: jamalarm):
Hi [~dawidwys],

I'm hoping you might be able to point me in the right direction, we just had 
this same error in production with different parameters. Below is the error I 
got:


{code:text}
java.lang.RuntimeException: Exception occurred while processing valve output 
watermark:
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 

[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-11-21 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10960:

Description: 
Hi Guys,

Encountered a strange one today. We use the CEP library in a configurable way 
where we plug a config file into the Flink Job JAR and it programmatically sets 
up a bunch of CEP operators matching the config file.

I encountered a strange bug when I was testing with some artificially low 
numbers in our testing environment today. The CEP code we're using (modified 
slightly) is:

{code:java}
Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(config.getNumberOfUniqueEvents())
.where(uniquenessCheckOnAlreadyMatchedEvents())
.within(seconds(config.getWithinSeconds()));
{code}

When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following 
error killing the job whenever a match was detected:

{quote}
ava.lang.RuntimeException: Exception occurred while processing valve output 
watermark: 
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
does not exist in the NFA. NFA has states [Final State $endState$ [
]), Normal State eventSequence [
StateTransition(TAKE, from eventSequenceto $endState$, with condition),
StateTransition(IGNORE, from eventSequenceto purchaseSequence, with 
condition),
]), Start State purchaseSequence:0 [
StateTransition(TAKE, from eventSequence:0to purchaseSequence, with 
condition),
])]
at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
at 
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
{quote}

Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing 
it back to 2 brought the problem back. It seems to be specifically related to 
the value of 2.

This is not a blocking issue for me because we typically use much higher 
numbers than this in production anyway, but I figured you guys might want to 
know about this issue.

Let me know if you need any more information.

Tom

  was:
Hi Guys,

Encountered a strange one today. We use the CEP library in a configurable way 
where we plug a config file into the Flink Job JAR and it programmatically sets 
up a bunch of CEP operators matching the config file.

I encountered a strange bug when I was testing with some artificially low 
numbers in our testing environment today. The CEP code we're using (modified 
slightly) is:

{{
Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(config.getNumberOfUniqueEvents())
.where(uniquenessCheckOnAlreadyMatchedEvents())
.within(seconds(config.getWithinSeconds()));
}}

When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following 
error killing the job whenever a match was detected:

{quote}
ava.lang.RuntimeException: Exception occurred while processing valve output 
watermark: 
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 

[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-11-21 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10960:

Description: 
Hi Guys,

Encountered a strange one today. We use the CEP library in a configurable way 
where we plug a config file into the Flink Job JAR and it programmatically sets 
up a bunch of CEP operators matching the config file.

I encountered a strange bug when I was testing with some artificially low 
numbers in our testing environment today. The CEP code we're using (modified 
slightly) is:

{{
Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(config.getNumberOfUniqueEvents())
.where(uniquenessCheckOnAlreadyMatchedEvents())
.within(seconds(config.getWithinSeconds()));
}}

When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following 
error killing the job whenever a match was detected:

{quote}
ava.lang.RuntimeException: Exception occurred while processing valve output 
watermark: 
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
does not exist in the NFA. NFA has states [Final State $endState$ [
]), Normal State eventSequence [
StateTransition(TAKE, from eventSequenceto $endState$, with condition),
StateTransition(IGNORE, from eventSequenceto purchaseSequence, with 
condition),
]), Start State purchaseSequence:0 [
StateTransition(TAKE, from eventSequence:0to purchaseSequence, with 
condition),
])]
at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
at 
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
{quote}

Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing 
it back to 2 brought the problem back. It seems to be specifically related to 
the value of 2.

This is not a blocking issue for me because we typically use much higher 
numbers than this in production anyway, but I figured you guys might want to 
know about this issue.

Let me know if you need any more information.

Tom

  was:
Hi Guys,

Encountered a strange one today. We use the CEP library in a configurable way 
where we plug a config file into the Flink Job JAR and it programmatically sets 
up a bunch of CEP operators matching the config file.

I encountered a strange bug when I was testing with some artificially low 
numbers in our testing environment today. The CEP code we're using (modified 
slightly) is:

{{
Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(config.getNumberOfUniqueEvents())
.where(uniquenessCheckOnAlreadyMatchedEvents())
.within(seconds(config.getWithinSeconds()));
}}

When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following 
error killing the job whenever a match was detected:

{quote}
ava.lang.RuntimeException: Exception occurred while processing valve output 
watermark: 
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 

[jira] [Created] (FLINK-10960) CEP: Job Failure when .times(2) is used

2018-11-21 Thread Thomas Wozniakowski (JIRA)
Thomas Wozniakowski created FLINK-10960:
---

 Summary: CEP: Job Failure when .times(2) is used
 Key: FLINK-10960
 URL: https://issues.apache.org/jira/browse/FLINK-10960
 Project: Flink
  Issue Type: Bug
  Components: CEP
Affects Versions: 1.6.2
Reporter: Thomas Wozniakowski


Hi Guys,

Encountered a strange one today. We use the CEP library in a configurable way 
where we plug a config file into the Flink Job JAR and it programmatically sets 
up a bunch of CEP operators matching the config file.

I encountered a strange bug when I was testing with some artificially low 
numbers in our testing environment today. The CEP code we're using (modified 
slightly) is:

{{
Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.times(config.getNumberOfUniqueEvents())
.where(uniquenessCheckOnAlreadyMatchedEvents())
.within(seconds(config.getWithinSeconds()));
}}

When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following 
error killing the job whenever a match was detected:

{quote}
ava.lang.RuntimeException: Exception occurred while processing valve output 
watermark: 
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
at 
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 
does not exist in the NFA. NFA has states [Final State $endState$ [
]), Normal State eventSequence [
StateTransition(TAKE, from eventSequenceto $endState$, with condition),
StateTransition(IGNORE, from eventSequenceto purchaseSequence, with 
condition),
]), Start State purchaseSequence:0 [
StateTransition(TAKE, from eventSequence:0to purchaseSequence, with 
condition),
])]
at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144)
at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270)
at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389)
at 
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293)
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251)
at 
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262)
{quote}

Changing the config to {{ numberOfUniqueEvents: 3 }} fixed the problem. 
Changing it back to 2 brought the problem back. It seems to be specifically 
related to the value of 2.

This is not a blocking issue for me because we typically use much higher 
numbers than this in production anyway, but I figured you guys might want to 
know about this issue.

Let me know if you need any more information.

Tom



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10570) State grows unbounded when "within" constraint not applied

2018-11-04 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674424#comment-16674424
 ] 

Thomas Wozniakowski commented on FLINK-10570:
-

Hi [~dawidwys], I notice this is scheduled for inclusion in 1.6.3, but it looks 
like the PR was merged several days before 1.6.2 was released. Just wanted to 
check if this fix might have snuck into 1.6.2 before it went out?

Otherwise we're good to wait on 1.6.3, but it would be super handy if the fix 
was available now :)

> State grows unbounded when "within" constraint not applied
> --
>
> Key: FLINK-10570
> URL: https://issues.apache.org/jira/browse/FLINK-10570
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.6.1
>Reporter: Thomas Wozniakowski
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.3, 1.7.0
>
>
> We have been running some failure monitoring using the CEP library. Simple 
> stuff that should probably have been implemented with a window, rather than 
> CEP, but we had already set the project up to use CEP elsewhere and it was 
> trivial to add this.
> We ran the following pattern (on 1.4.2):
> {code:java}
> begin(PURCHASE_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
> .subtype(PurchaseEvent.class)
> .times(100)
> {code}
> and then flat selected the responses if the failure ratio was over a certain 
> threshold.
> With 1.6.1, the state size of the CEP operator for this pattern grows 
> unbounded, and eventually destroys the job with an OOM exception. We have 
> many CEP operators in this job but all the rest use a "within" call.
> In 1.4.2, it seems events would be discarded once they were no longer in the 
> 100 most recent, now it seems they are held onto indefinitely. 
> We have a workaround (we're just going to add a "within" call to force the 
> CEP operator to discard old events), but it would be useful if we could have 
> the old behaviour back.
> Please let me know if I can provide any more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10570) State grows unbounded when "within" constraint not applied

2018-10-16 Thread Thomas Wozniakowski (JIRA)
Thomas Wozniakowski created FLINK-10570:
---

 Summary: State grows unbounded when "within" constraint not applied
 Key: FLINK-10570
 URL: https://issues.apache.org/jira/browse/FLINK-10570
 Project: Flink
  Issue Type: Bug
  Components: CEP
Affects Versions: 1.6.1
Reporter: Thomas Wozniakowski


We have been running some failure monitoring using the CEP library. Simple 
stuff that should probably have been implemented with a window, rather than 
CEP, but we had already set the project up to use CEP elsewhere and it was 
trivial to add this.

We ran the following pattern (on 1.4.2):

{code:java}
begin(PURCHASE_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent())
.subtype(PurchaseEvent.class)
.times(100)
{code}

and then flat selected the responses if the failure ratio was over a certain 
threshold.

With 1.6.1, the state size of the CEP operator for this pattern grows 
unbounded, and eventually destroys the job with an OOM exception. We have many 
CEP operators in this job but all the rest use a "within" call.

In 1.4.2, it seems events would be discarded once they were no longer in the 
100 most recent, now it seems they are held onto indefinitely. 

We have a workaround (we're just going to add a "within" call to force the CEP 
operator to discard old events), but it would be useful if we could have the 
old behaviour back.

Please let me know if I can provide any more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2018-10-04 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638747#comment-16638747
 ] 

Thomas Wozniakowski commented on FLINK-10475:
-

Sure - I can update the docs. I'll say that it's recommend to use *3.5.4-beta* 
or *3.4.13*. Sound reasonable?

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Minor
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Summary: Standalone HA - Leader election is not triggered on loss of leader 
(ZK 3.5.3-beta only)  (was: Standalone HA - Leader election is not triggered on 
loss of leader)

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Priority: Minor  (was: Blocker)

> Standalone HA - Leader election is not triggered on loss of leader (ZK 
> 3.5.3-beta only)
> ---
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Minor
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636033#comment-16636033
 ] 

Thomas Wozniakowski commented on FLINK-10475:
-

Aha - so it appears to be the version of Zookeeper. Using *3.5.3-beta* causes 
the silent no-failover, using *3.5.4-beta* works as intended.

Maybe worth adding a client side check to refuse to start if connecting to a 
*3.5.3-beta* quorum?

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 


> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Affects Version/s: 1.6.1

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
Happy to see that the issue of jobgraphs hanging around forever has been 
resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 


> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). 
> Happy to see that the issue of jobgraphs hanging around forever has been 
> resolved in standalone/zookeeper HA mode, but now I'm seeing a different 
> issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum 

[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635441#comment-16635441
 ] 

Thomas Wozniakowski commented on FLINK-10475:
-

[~till.rohrmann]
Thanks for the response - I've added the three JM logs in the description, 
together with a bit more info.

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
> retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: 
> Ask timed out on 
> [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
> [1 ms]. Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
>   at akka.dispatch.OnComplete.internal(Future.scala:258)
>   at akka.dispatch.OnComplete.internal(Future.scala:256)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>   at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>   at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>   at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Please give me a shout if I can provide any more useful information
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-02 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Attachment: t1.log
t2.log
t3.log

> Standalone HA - Leader election is not triggered on loss of leader
> --
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.5.4
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
> retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: 
> Ask timed out on 
> [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
> [1 ms]. Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
>   at akka.dispatch.OnComplete.internal(Future.scala:258)
>   at akka.dispatch.OnComplete.internal(Future.scala:256)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>   at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>   at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>   at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Please give me a shout if I can provide any more useful information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-10-01 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634203#comment-16634203
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

[~till.rohrmann] I've now tested the fix on 1.5.4. It seems to have fixed the 
job graph problem, but I'm encountering another blocking issue on HA failover 
(leader election not triggering at all). I've raised FLINK-10475 to track it.

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-01 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

```
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 

[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-01 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:

Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 

[jira] [Created] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

2018-10-01 Thread Thomas Wozniakowski (JIRA)
Thomas Wozniakowski created FLINK-10475:
---

 Summary: Standalone HA - Leader election is not triggered on loss 
of leader
 Key: FLINK-10475
 URL: https://issues.apache.org/jira/browse/FLINK-10475
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.5.4
Reporter: Thomas Wozniakowski


Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

```
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-09-20 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621772#comment-16621772
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hey [~till.rohrmann],

It's kind of non-trivial for me to test the fixes, as our cluster is currently 
running the non-hadoop 1.4.3 build. As far as I can see the only snapshot 
builds available contain hadoop, so I didn't know if the tests would be 
representative. I was waiting on the official release binaries before spending 
time testing.

I can have a go at testing from a local maven build, but I've had significant 
trouble wrestling with maven on the Flink codebase in the past (trying to build 
locally). If you could point me at a branch (say for the 1.5 release) and let 
me know what maven command I should use to build it with no hadoop, and scala 
2.11, then I would be very grateful. I could then use those binaries for 
testing.

Tom

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-09-10 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609299#comment-16609299
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hey [~till.rohrmann] - this may sound like a silly question, but I'm not 
actually sure what the best way to deploy your fix is...

Should I check out the branch and do a maven build, then deploy the cluster 
using those artifacts? do I also need to rebuild my job jar using a patched 
version of the Flink dependency?

Apologies - we're just set up to pull binaries from the Apache servers at the 
moment, so this isn't super obvious to me...

Tom

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-09-06 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605882#comment-16605882
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hi [~till.rohrmann]

Happy to help out with testing the fix. I'll keep an eye on the Flink blog for 
the next release (or you can @ me if you need quicker feedback). I'll deploy it 
to our testing environments and re-run my tests. The bug was easy enough to 
reproduce in my experience.

Tom

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-09-06 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605485#comment-16605485
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hi [~till.rohrmann],

Yes. We are running 3/3/3 zookeeper/jobmanager/taskmanagers in standalone mode.

Please let me know if you need any more info. This issue is currently blocking 
us and I'm more than happy to assist as much as I can in fixing it.

Tom

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587293#comment-16587293
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

I'm just combing through the Zookeeper logs to see if there's anything that 
might be helpful. I'm going to dump anything that looks a bit odd here:

{quote}
2018-08-21 10:27:05,657 [myid:160] - INFO  [ProcessThread(sid:160 
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when 
processing sessionid:0x75066362001c type:create cxid:0x6 zxid:0x200fe 
txntype:-1 reqpath:n/a Error 
Path:/flink/cluster_one/leaderlatch/rest_server_lock Error:KeeperErrorCode = 
NoNode for /flink/cluster_one/leaderlatch/rest_server_lock
2018-08-21 10:27:05,938 [myid:160] - INFO  [ProcessThread(sid:160 
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when 
processing sessionid:0x75066362001c type:create cxid:0x24 zxid:0x20104 
txntype:-1 reqpath:n/a Error 
Path:/flink/cluster_one/leaderlatch/resource_manager_lock Error:KeeperErrorCode 
= NoNode for /flink/cluster_one/leaderlatch/resource_manager_lock
2018-08-21 10:27:05,944 [myid:160] - INFO  [ProcessThread(sid:160 
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when 
processing sessionid:0x75066362001c type:create cxid:0x29 zxid:0x20105 
txntype:-1 reqpath:n/a Error 
Path:/flink/cluster_one/leaderlatch/dispatcher_lock Error:KeeperErrorCode = 
NoNode for /flink/cluster_one/leaderlatch/dispatcher_lock
{quote}

{quote}
2018-08-21 10:28:35,032 [myid:160] - INFO  [ProcessThread(sid:160 
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when 
processing sessionid:0x75066362001c type:create cxid:0xde zxid:0x20145 
txntype:-1 reqpath:n/a Error 
Path:/flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/061/ada19912-8c78-4f15-b1ef-f0acc5011559
 Error:KeeperErrorCode = NodeExists for 
/flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/061/ada19912-8c78-4f15-b1ef-f0acc5011559
2018-08-21 10:28:35,184 [myid:160] - INFO  [ProcessThread(sid:160 
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when 
processing sessionid:0x75066362001c type:create cxid:0xe2 zxid:0x20146 
txntype:-1 reqpath:n/a Error 
Path:/flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock
 Error:KeeperErrorCode = NoNode for 
/flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock
{quote}



> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587280#comment-16587280
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

We also performed a Zookeeper upgrade as part of our cluster upgrade (from 
{{3.5.3-beta}} to {{3.5.4-beta}}).

I have just rerun the tests, the bug is reproducible against both versions of 
Zookeeper, so this does not appear to be the culprit.

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10184:

Affects Version/s: 1.6.0

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587119#comment-16587119
 ] 

Thomas Wozniakowski edited comment on FLINK-10184 at 8/21/18 7:52 AM:
--

Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove)}}


was (Author: jamalarm):
Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove(java.lang.String,
 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.RemoveCallback)}}

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2
>Reporter: Thomas Wozniakowski
>Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587119#comment-16587119
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove(java.lang.String,
 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.RemoveCallback)}}

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2
>Reporter: Thomas Wozniakowski
>Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-21 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587119#comment-16587119
 ] 

Thomas Wozniakowski edited comment on FLINK-10184 at 8/21/18 7:52 AM:
--

Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove}}


was (Author: jamalarm):
Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove)}}

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2
>Reporter: Thomas Wozniakowski
>Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-08-20 Thread Thomas Wozniakowski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586188#comment-16586188
 ] 

Thomas Wozniakowski commented on FLINK-10184:
-

Hi [~elevy],

I don't believe it is the same issue (though it may be related). In that issue, 
the jobs are actually successfully recovered (and just fail due to an absence 
of task slots). In our case, the actual Job Manager immediately dies with logs 
like this:

{quote}
2018-08-20 16:29:04,535 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
occurred in the cluster entrypoint.
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at 
org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set 
up JobManager
at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
at 
org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at 
org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: No such 
file or directory: 
s3://ew1-integration-pattern-nsbucket-18jn-flinkbucket-1his9qugdhp03/flink/cluster_one/blob/job_4e9a5a9d70ca99dbd394c35f8dfeda65/blob_p-fa5168561c98e3005a724cb817a1ec1a0b3bd3eb-03a884a908837dc8b5a387fb502afa2f
at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:134)
... 25 more
Caused by: java.io.FileNotFoundException: No such file or directory: 
s3://ew1-integration-pattern-nsbucket-18jn-flinkbucket-1his9qugdhp03/flink/cluster_one/blob/job_4e9a5a9d70ca99dbd394c35f8dfeda65/blob_p-fa5168561c98e3005a724cb817a1ec1a0b3bd3eb-03a884a908837dc8b5a387fb502afa2f
at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1642)
at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:521)
at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:119)
at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:36)
at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:102)
at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:84)
at 
org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:506)
at 

  1   2   >