[jira] [Updated] (FLINK-23353) UDTAGG can't execute in Batch mode

2021-07-12 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-23353:

Description: 
{code:java}

public class Top2Test {
public static void main(String[] args) {

EnvironmentSettings settings = 
EnvironmentSettings.newInstance().inBatchMode()build();
TableEnvironment tEnv = TableEnvironment.create(settings);

Table sourceTable = tEnv.fromValues(
DataTypes.ROW(
DataTypes.FIELD("id", DataTypes.INT()),
DataTypes.FIELD("name",DataTypes.STRING()),
DataTypes.FIELD("price", DataTypes.INT())
),
row(1, "hayden", 18),
row(3, "hayden", 19),
row(4, "hayden", 20),
row(2, "jaylin", 20)
);

tEnv.createTemporaryView("source", sourceTable);

Table rT = tEnv.from("source")
.groupBy($("name"))
.flatAggregate(call(Top2.class, $("price")).as("price", "rank"))
.select($("name"), $("price"), $("rank"));
rT.execute().print();
}


public static class Top2Accumulator {
public Integer first;
public Integer second;
}

public static class Top2 extends TableAggregateFunction, Top2Accumulator> {

@Override
public Top2Accumulator createAccumulator() {
Top2Accumulator acc = new Top2Accumulator();
acc.first = Integer.MIN_VALUE;
acc.second = Integer.MIN_VALUE;
return acc;
}

public void accumulate(Top2Accumulator acc, Integer value) {
if (value > acc.first) {
acc.second = acc.first;
acc.first = value;
} else if (value > acc.second) {
acc.second = value;
}
}

public void merge(Top2Accumulator acc, Iterable it) {
for (Top2Accumulator otherAcc : it) {
accumulate(acc, otherAcc.first);
accumulate(acc, otherAcc.second);
}
}

public void emitValue(Top2Accumulator acc, Collector> out) {
if (acc.first != Integer.MIN_VALUE) {
out.collect(Tuple2.of(acc.first, 1));
}
if (acc.second != Integer.MIN_VALUE) {
out.collect(Tuple2.of(acc.second, 2));
}
}
}

}

{code}

got errors as below:

{code:java}
Exception in thread "main" org.apache.flink.table.api.TableException: Cannot 
generate a valid execution plan for the given query: 

LogicalSink(table=[default_catalog.default_database.Unregistered_Collect_Sink_1],
 fields=[name, price, rank])
+- LogicalProject(name=[AS($0, _UTF-16LE'name')], price=[AS($1, 
_UTF-16LE'price')], rank=[AS($2, _UTF-16LE'rank')])
   +- LogicalTableAggregate(group=[{1}], 
tableAggregate=[[flinktest$Top2Test$Top2$4619034833a29d53c136506047509219($2)]])
  +- LogicalUnion(all=[true])
 :- LogicalProject(id=[CAST(1):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(18):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 :- LogicalProject(id=[CAST(3):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(19):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 :- LogicalProject(id=[CAST(4):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(20):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 +- LogicalProject(id=[CAST(2):INTEGER], 
name=[CAST(_UTF-16LE'jaylin':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(20):INTEGER])
+- LogicalValues(tuples=[[{ 0 }]])

This exception indicates that the query uses an unsupported SQL feature.
Please check the documentation for the set of currently supported SQL features.
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkVolcanoProgram.optimize(FlinkVolcanoProgram.scala:72)
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram$$anonfun$optimize$1.apply(FlinkChainedProgram.scala:62)
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram$$anonfun$optimize$1.apply(FlinkChainedProgram.scala:58)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 

[jira] [Created] (FLINK-23353) UDTAGG can't execute in Batch mode

2021-07-12 Thread hayden zhou (Jira)
hayden zhou created FLINK-23353:
---

 Summary: UDTAGG can't execute in Batch mode
 Key: FLINK-23353
 URL: https://issues.apache.org/jira/browse/FLINK-23353
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.13.1
Reporter: hayden zhou



{code:java}

public class Top2Test {
public static void main(String[] args) {

EnvironmentSettings settings = 
EnvironmentSettings.newInstance().inBatchMode()build();
TableEnvironment tEnv = TableEnvironment.create(settings);

Table sourceTable = tEnv.fromValues(
DataTypes.ROW(
DataTypes.FIELD("id", DataTypes.INT()),
DataTypes.FIELD("name",DataTypes.STRING()),
DataTypes.FIELD("price", DataTypes.INT())
),
row(1, "hayden", 18),
row(3, "hayden", 19),
row(4, "hayden", 20),
row(2, "jaylin", 20)
);

tEnv.createTemporaryView("source", sourceTable);

Table rT = tEnv.from("source")
.groupBy($("name"))
.flatAggregate(call(Top2.class, $("price")).as("price", "rank"))
.select($("name"), $("price"), $("rank"));
rT.execute().print();
}


public static class Top2Accumulator {
public Integer first;
public Integer second;
}

public static class Top2 extends TableAggregateFunction, Top2Accumulator> {

@Override
public Top2Accumulator createAccumulator() {
Top2Accumulator acc = new Top2Accumulator();
acc.first = Integer.MIN_VALUE;
acc.second = Integer.MIN_VALUE;
return acc;
}

public void accumulate(Top2Accumulator acc, Integer value) {
if (value > acc.first) {
acc.second = acc.first;
acc.first = value;
} else if (value > acc.second) {
acc.second = value;
}
}

public void merge(Top2Accumulator acc, Iterable it) {
for (Top2Accumulator otherAcc : it) {
accumulate(acc, otherAcc.first);
accumulate(acc, otherAcc.second);
}
}

public void emitValue(Top2Accumulator acc, Collector> out) {
if (acc.first != Integer.MIN_VALUE) {
out.collect(Tuple2.of(acc.first, 1));
}
if (acc.second != Integer.MIN_VALUE) {
out.collect(Tuple2.of(acc.second, 2));
}
}
}

}

{code}

got errors as below:
Exception in thread "main" org.apache.flink.table.api.TableException: Cannot 
generate a valid execution plan for the given query: 

LogicalSink(table=[default_catalog.default_database.Unregistered_Collect_Sink_1],
 fields=[name, price, rank])
+- LogicalProject(name=[AS($0, _UTF-16LE'name')], price=[AS($1, 
_UTF-16LE'price')], rank=[AS($2, _UTF-16LE'rank')])
   +- LogicalTableAggregate(group=[{1}], 
tableAggregate=[[flinktest$Top2Test$Top2$4619034833a29d53c136506047509219($2)]])
  +- LogicalUnion(all=[true])
 :- LogicalProject(id=[CAST(1):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(18):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 :- LogicalProject(id=[CAST(3):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(19):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 :- LogicalProject(id=[CAST(4):INTEGER], 
name=[CAST(_UTF-16LE'hayden':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(20):INTEGER])
 :  +- LogicalValues(tuples=[[{ 0 }]])
 +- LogicalProject(id=[CAST(2):INTEGER], 
name=[CAST(_UTF-16LE'jaylin':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"], 
price=[CAST(20):INTEGER])
+- LogicalValues(tuples=[[{ 0 }]])

This exception indicates that the query uses an unsupported SQL feature.
Please check the documentation for the set of currently supported SQL features.
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkVolcanoProgram.optimize(FlinkVolcanoProgram.scala:72)
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram$$anonfun$optimize$1.apply(FlinkChainedProgram.scala:62)
at 
org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram$$anonfun$optimize$1.apply(FlinkChainedProgram.scala:58)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 

[jira] [Comment Edited] (FLINK-22054) Using a shared watcher for ConfigMap watching

2021-04-14 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321916#comment-17321916
 ] 

hayden zhou edited comment on FLINK-22054 at 4/15/21, 5:53 AM:
---

[~trohrmann] [~yittg] I can't submit batch jobs more than almost 60, I 
submitted one job every 5 minutes if it is watching connection is exhausted.  
did the connection did not release to pool when one batch job is closed?


was (Author: hayden zhou):
[~trohrmann] [~yittg] I can't submit batch jobs more than almost 60, I 
submitted one job every 5 minutes if it is watching connection is exhausted.  
did the connection did not close when one batch job is closed?

> Using a shared watcher for ConfigMap watching
> -
>
> Key: FLINK-22054
> URL: https://issues.apache.org/jira/browse/FLINK-22054
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Yi Tang
>Assignee: Yi Tang
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Fix For: 1.14.0
>
>
> While using K8s HA service, the watching for ConfigMap is separate for each 
> job. As the number of running jobs increases, this consumes a large amount of 
> connections.
> Here we proposal to use a shard watcher for each FlinkKubeClient, and 
> dispatch events to different listeners. At the same time, we should keep the 
> same semantic with watching separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22054) Using a shared watcher for ConfigMap watching

2021-04-14 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321916#comment-17321916
 ] 

hayden zhou edited comment on FLINK-22054 at 4/15/21, 5:48 AM:
---

[~trohrmann] [~yittg] I can't submit batch jobs more than almost 60, I 
submitted one job every 5 minutes if it is watching connection is exhausted.  
did the connection did not close when one batch job is closed?


was (Author: hayden zhou):
[~trohrmann] [~yittg] I can't submit batch jobs more than almost 60, I 
submitted one job every 5 minutes if it is watching connection is exhausted.  
did the connection did not close when on connection job is closed?

> Using a shared watcher for ConfigMap watching
> -
>
> Key: FLINK-22054
> URL: https://issues.apache.org/jira/browse/FLINK-22054
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Yi Tang
>Assignee: Yi Tang
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Fix For: 1.14.0
>
>
> While using K8s HA service, the watching for ConfigMap is separate for each 
> job. As the number of running jobs increases, this consumes a large amount of 
> connections.
> Here we proposal to use a shard watcher for each FlinkKubeClient, and 
> dispatch events to different listeners. At the same time, we should keep the 
> same semantic with watching separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22054) Using a shared watcher for ConfigMap watching

2021-04-14 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321916#comment-17321916
 ] 

hayden zhou commented on FLINK-22054:
-

[~trohrmann] [~yittg] I can't submit batch jobs more than almost 60, I 
submitted one job every 5 minutes if it is watching connection is exhausted.  
did the connection did not close when on connection job is closed?

> Using a shared watcher for ConfigMap watching
> -
>
> Key: FLINK-22054
> URL: https://issues.apache.org/jira/browse/FLINK-22054
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Yi Tang
>Assignee: Yi Tang
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Fix For: 1.14.0
>
>
> While using K8s HA service, the watching for ConfigMap is separate for each 
> job. As the number of running jobs increases, this consumes a large amount of 
> connections.
> Here we proposal to use a shard watcher for each FlinkKubeClient, and 
> dispatch events to different listeners. At the same time, we should keep the 
> same semantic with watching separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-13 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320693#comment-17320693
 ] 

hayden zhou edited comment on FLINK-22047 at 4/14/21, 4:30 AM:
---

[~trohrmann] thanks for your reply, hopes to see the fixed version release as 
soon as possible:)(y)(y)


was (Author: hayden zhou):
[~trohrmann] thanks for your reply, hopes to see the fixed version release as 
soon as possible

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-13 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320693#comment-17320693
 ] 

hayden zhou commented on FLINK-22047:
-

[~trohrmann] thanks for your reply, hopes to see the fixed version release as 
soon as possible

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-13 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320017#comment-17320017
 ] 

hayden zhou commented on FLINK-22047:
-

the amount of jobs a cluster can submit is almost 70, if exceed 70 then can't 
submit jobs again. 

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319469#comment-17319469
 ] 

hayden zhou edited comment on FLINK-22047 at 4/12/21, 2:16 PM:
---

can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  we can set up a k8s native session mode cluster then submit a job per 
five minutes, then we can reproduce the problem within one day. 
[~trohrmann],[~TsReaper],[~wangyang]


was (Author: hayden zhou):
can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day. [~trohrmann],[~TsReaper],[~wangyang]

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319467#comment-17319467
 ] 

hayden zhou edited comment on FLINK-22047 at 4/12/21, 2:12 PM:
---

[~TsReaper] I am using Kubernetes native session mode submit jobs,  always 
can't submit jobs after two days normally submitted. I had download the jm 
logs.  but I can't attach it to this. because it's bigger than the upload limit.


was (Author: hayden zhou):
[~TsReaper] I am using Kubernetes native session mode submit jobs,  always 
can't submit jobs after two day's normally submit. i have download the jm logs. 
 but i cant't attach it to this . 

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319469#comment-17319469
 ] 

hayden zhou edited comment on FLINK-22047 at 4/12/21, 2:10 PM:
---

can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day. [~trohrmann],[~TsReaper],[~wangyang]


was (Author: hayden zhou):
can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day. [~trohrmann][~TsReaper]

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319469#comment-17319469
 ] 

hayden zhou commented on FLINK-22047:
-

can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day.

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319469#comment-17319469
 ] 

hayden zhou edited comment on FLINK-22047 at 4/12/21, 2:06 PM:
---

can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day. [~trohrmann][~TsReaper]


was (Author: hayden zhou):
can't submit to Kubernetes native session mode cluster after several days is an 
inevitable problem。 I have to restart the cluster so that I can submit jobs 
again.  you can submit a job per five minutes, then you can find the problem in 
one day.

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-04-12 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319467#comment-17319467
 ] 

hayden zhou commented on FLINK-22047:
-

[~TsReaper] I am using Kubernetes native session mode submit jobs,  always 
can't submit jobs after two day's normally submit. i have download the jm logs. 
 but i cant't attach it to this . 

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Assignee: Caizhi Weng
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-03-30 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311476#comment-17311476
 ] 

hayden zhou commented on FLINK-22047:
-

2021-03-30 09:36:52,903 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
MultipleInput(readOrder=[0,0,0,0,0,1], 
members=[\nHashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], 
select=[time, hour, appId, pv0, uv, shareCount, likeCount, hour0, 
commentCount], isBroadcast=[true], build=[right])\n:- Calc(select=[time, hour, 
appId, pv0, uv, shareCount, likeCount])\n:  +- 
HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], select=[time, hour, 
appId, pv0, uv, shareCount, hour0, likeCount], isBroadcast=[true], 
build=[right])\n: :- Calc(select=[time, hour, appId, pv0, uv, 
shareCount])\n: :  +- HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, 
hour)], select=[time, hour, appId, pv0, uv, hour0, shareCount], 
build=[right])\n: : :- Calc(select=[time, hour, appId, pv0, uv])\n: 
: :  +- HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], 
select=[time, hour, appId, pv0, hour0, uv], isBroadcast=[true], 
build=[right])\n: : : :- Calc(select=[time, hour, appId, pv AS 
pv0])\n: : : :  +- HashJoin(joinType=[LeftOuterJoin], 
where=[=(hour0, hour)], select=[time, hour, appId, hour0, pv], 
build=[right])\n: : : : :- Calc(select=[time, hour, appId])\n:  
   : : : :  +- SortAggregate(isMerge=[true], groupBy=[hour], 
select=[hour, Final_MAX(max$0) AS time, Final_MAX(max$1) AS appId])\n: :
 : : : +- Sort(orderBy=[hour ASC])\n: : : : :   
 +- [#6] Exchange(distribution=[hash[hour]])\n: : : : +- 
HashAggregate(isMerge=[true], groupBy=[hour], select=[hour, 
Final_COUNT(count$0) AS pv])\n: : : :+- [#5] 
Exchange(distribution=[hash[hour]])\n: : : +- [#4] 
Exchange(distribution=[broadcast])\n: : +- 
HashAggregate(isMerge=[true], groupBy=[hour], select=[hour, 
Final_COUNT(count$0) AS shareCount])\n: :+- [#3] 
Exchange(distribution=[hash[hour]])\n: +- [#2] 
Exchange(distribution=[broadcast])\n+- [#1] 
Exchange(distribution=[broadcast])\n]) -> Calc(select=[time, hour, appId, (pv0 
IS NOT NULL CASE CAST(pv0) CASE 0:BIGINT) AS pv, (uv IS NOT NULL CASE CAST(uv) 
CASE 0:BIGINT) AS uv, (shareCount IS NOT NULL CASE CAST(shareCount) CASE 
0:BIGINT) AS shareCount, (likeCount IS NOT NULL CASE CAST(likeCount) CASE 
0:BIGINT) AS likeCount, (commentCount IS NOT NULL CASE CAST(commentCount) CASE 
0:BIGINT) AS commentCount]) -> Sink: Select table sink (1/1) 
(262fca25a1e13d9ca47e636f94f8152c) switched from DEPLOYING to RUNNING.
2021-03-30 09:36:53,256 INFO  
org.apache.flink.streaming.api.operators.collect.CollectSinkOperatorCoordinator 
[] - Received sink socket server address: /172.16.4.131:41380
2021-03-30 09:36:53,346 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
MultipleInput(readOrder=[0,0,0,0,0,1], 
members=[\nHashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], 
select=[time, hour, appId, pv0, uv, shareCount, likeCount, hour0, 
commentCount], isBroadcast=[true], build=[right])\n:- Calc(select=[time, hour, 
appId, pv0, uv, shareCount, likeCount])\n:  +- 
HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], select=[time, hour, 
appId, pv0, uv, shareCount, hour0, likeCount], isBroadcast=[true], 
build=[right])\n: :- Calc(select=[time, hour, appId, pv0, uv, 
shareCount])\n: :  +- HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, 
hour)], select=[time, hour, appId, pv0, uv, hour0, shareCount], 
build=[right])\n: : :- Calc(select=[time, hour, appId, pv0, uv])\n: 
: :  +- HashJoin(joinType=[LeftOuterJoin], where=[=(hour0, hour)], 
select=[time, hour, appId, pv0, hour0, uv], isBroadcast=[true], 
build=[right])\n: : : :- Calc(select=[time, hour, appId, pv AS 
pv0])\n: : : :  +- HashJoin(joinType=[LeftOuterJoin], 
where=[=(hour0, hour)], select=[time, hour, appId, hour0, pv], 
build=[right])\n: : : : :- Calc(select=[time, hour, appId])\n:  
   : : : :  +- SortAggregate(isMerge=[true], groupBy=[hour], 
select=[hour, Final_MAX(max$0) AS time, Final_MAX(max$1) AS appId])\n: :
 : : : +- Sort(orderBy=[hour ASC])\n: : : : :   
 +- [#6] Exchange(distribution=[hash[hour]])\n: : : : +- 
HashAggregate(isMerge=[true], groupBy=[hour], select=[hour, 
Final_COUNT(count$0) AS pv])\n: : : :+- [#5] 
Exchange(distribution=[hash[hour]])\n: : : +- [#4] 
Exchange(distribution=[broadcast])\n: : +- 
HashAggregate(isMerge=[true], groupBy=[hour], select=[hour, 
Final_COUNT(count$0) AS shareCount])\n: :+- [#3] 
Exchange(distribution=[hash[hour]])\n: +- [#2] 

[jira] [Commented] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-03-30 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311475#comment-17311475
 ] 

hayden zhou commented on FLINK-22047:
-

can not attach log。 !screenshot-1.png! 

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-03-30 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-22047:

Attachment: screenshot-1.png

> Could not find FLINSHED Flink job and can't submit job 
> ---
>
> Key: FLINK-22047
> URL: https://issues.apache.org/jira/browse/FLINK-22047
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.2
>Reporter: hayden zhou
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Could not find FLINSHED Flink job,  and aways can't submit job by 
> insufficient slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22047) Could not find FLINSHED Flink job and can't submit job

2021-03-30 Thread hayden zhou (Jira)
hayden zhou created FLINK-22047:
---

 Summary: Could not find FLINSHED Flink job and can't submit job 
 Key: FLINK-22047
 URL: https://issues.apache.org/jira/browse/FLINK-22047
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.12.2
Reporter: hayden zhou


Could not find FLINSHED Flink job,  and aways can't submit job by insufficient 
slot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21840) can't submit flink k8s session job with kubernetes.rest-service.exposed.type=NodePort

2021-03-17 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303161#comment-17303161
 ] 

hayden zhou commented on FLINK-21840:
-

Ok. thank for your answer.

> can't submit flink k8s session job with 
> kubernetes.rest-service.exposed.type=NodePort
> -
>
> Key: FLINK-21840
> URL: https://issues.apache.org/jira/browse/FLINK-21840
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.12.2
> Environment: flink on native kubernetes with session mode
>Reporter: hayden zhou
>Priority: Major
>
> I have created a flink session cluster by `kubernetes-session` command with 
> -Dkubernetes.rest-service.exposed.type=NodePort options, because  we don't 
> want to expose the rest service external.
> when I submit flink job by `flink run --target kubernetes-session xxx` i 
> found this command will automatically find the Kubernetes ApiServer address 
> as the Node address. But my ApiService address IP is not in the node ips of 
> k8s cluster. can I specific a node IP explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16601) Correct the way to get Endpoint address for NodePort rest Service

2021-03-17 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303160#comment-17303160
 ] 

hayden zhou commented on FLINK-16601:
-

Yes, NodePort support is very useful for self-buit k8s cluter.

> Correct the way to get Endpoint address for NodePort rest Service
> -
>
> Key: FLINK-16601
> URL: https://issues.apache.org/jira/browse/FLINK-16601
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently, if one sets the type of the rest-service to {{NodePort}}, then the 
> way to get the Endpoint address is by calling the method of 
> {{KubernetesClient.getMasterUrl().getHost()}}. This solution works fine for 
> the case of the non-managed Kubernetes cluster but not for the managed ones.
> For the managed Kubernetes cluster setups, the Kubernetes masters are 
> deployed in a pool different from the Kubernetes nodes and the master node 
> does not expose a NodePort for the NodePort Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21840) can't submit flink k8s session job with kubernetes.rest-service.exposed.type=NodePort

2021-03-17 Thread hayden zhou (Jira)
hayden zhou created FLINK-21840:
---

 Summary: can't submit flink k8s session job with 
kubernetes.rest-service.exposed.type=NodePort
 Key: FLINK-21840
 URL: https://issues.apache.org/jira/browse/FLINK-21840
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.12.2
 Environment: flink on native kubernetes with session mode
Reporter: hayden zhou


I have created a flink session cluster by `kubernetes-session` command with 
-Dkubernetes.rest-service.exposed.type=NodePort options, because  we don't want 
to expose the rest service external.
when I submit flink job by `flink run --target kubernetes-session xxx` i found 
this command will automatically find the Kubernetes ApiServer address as the 
Node address. But my ApiService address IP is not in the node ips of k8s 
cluster. can I specific a node IP explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs or after running tow days

2021-03-04 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Description: 
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"
I am sure my slots are sufficient.

it always can't submit batch stream after almost two days running normally.

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}


  was:
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"
I am sure my slots are sufficient.

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 

[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-28 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292561#comment-17292561
 ] 

hayden zhou commented on FLINK-21325:
-

[~gvauvert]  in my case, I found it always reports " Slot request bulk is not 
fulfillable " after almost two days running normally.  

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
> Attachments: clear.log
>
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-26 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Attachment: clear.log

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
> Attachments: clear.log
>
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)

[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-26 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291600#comment-17291600
 ] 

hayden zhou commented on FLINK-21325:
-

I have submitted the full log in this site:  link: 
https://pan.baidu.com/s/1pvPi-Y_1T9T_cmVfDJow4g secret: fhu5,  and I have 
attached a file clear.log here. which cleaned checkpoint logs out for 
convenience.

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
> Attachments: clear.log
>
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the 

[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-26 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Attachment: (was: 
flink--standalonesession-0-mta-flink-jobmanager-56cf5bbf6-2chwc.log)

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Updated] (FLINK-21472) FencingTokenException: Fencing token mismatch

2021-02-23 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21472:

Description: 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled 
exception.
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
mismatch: Ignoring message LocalFencedMessage(8fac01d8e3e3988223a2e5c6e3f04f1e, 
LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing token 
8fac01d8e3e3988223a2e5c6e3f04f1e did not match the expected fencing token 
8c37414f464bca76144e6cabc946474b.

  was:

org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled 
exception.
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
mismatch: Ignoring message LocalFencedMessage(8fac01d8e3e3988223a2e5c6e3f04f1e, 
LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing token 
8fac01d8e3e3988223a2e5c6e3f04f1e did not match the expected fencing token 
8c37414f464bca76144e6cabc946474b.

Summary: FencingTokenException: Fencing token mismatch  (was: 
encingTokenException: Fencing token mismatch)

> FencingTokenException: Fencing token mismatch
> -
>
> Key: FLINK-21472
> URL: https://issues.apache.org/jira/browse/FLINK-21472
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.1
>Reporter: hayden zhou
>Priority: Major
> Attachments: 
> flink--standalonesession-0-mta-flink-jobmanager-864d6c8cbb-rmsxw.log
>
>
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled 
> exception.
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> mismatch: Ignoring message 
> LocalFencedMessage(8fac01d8e3e3988223a2e5c6e3f04f1e, 
> LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing 
> token 8fac01d8e3e3988223a2e5c6e3f04f1e did not match the expected fencing 
> token 8c37414f464bca76144e6cabc946474b.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21472) encingTokenException: Fencing token mismatch

2021-02-23 Thread hayden zhou (Jira)
hayden zhou created FLINK-21472:
---

 Summary: encingTokenException: Fencing token mismatch
 Key: FLINK-21472
 URL: https://issues.apache.org/jira/browse/FLINK-21472
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.12.1
Reporter: hayden zhou
 Attachments: 
flink--standalonesession-0-mta-flink-jobmanager-864d6c8cbb-rmsxw.log


org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled 
exception.
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
mismatch: Ignoring message LocalFencedMessage(8fac01d8e3e3988223a2e5c6e3f04f1e, 
LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing token 
8fac01d8e3e3988223a2e5c6e3f04f1e did not match the expected fencing token 
8c37414f464bca76144e6cabc946474b.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-23 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289456#comment-17289456
 ] 

hayden zhou commented on FLINK-21325:
-

JobManager log is attached above just now. 

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
> Attachments: 
> flink--standalonesession-0-mta-flink-jobmanager-56cf5bbf6-2chwc.log
>
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-23 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Attachment: 
flink--standalonesession-0-mta-flink-jobmanager-56cf5bbf6-2chwc.log

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
> Attachments: 
> flink--standalonesession-0-mta-flink-jobmanager-56cf5bbf6-2chwc.log
>
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-21 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288151#comment-17288151
 ] 

hayden zhou commented on FLINK-21325:
-

this problem always occurs after cluster runed several days

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-21 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288150#comment-17288150
 ] 

hayden zhou commented on FLINK-21325:
-

i found some exception like this 

{code:java}
2021-02-22 11:07:05,150 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
TableSourceScan(table=[[default_catalog, default_database, rowDataTable, 
project=[app_id, ip, ts, action]]], fields=[app_id, ip, ts, action]) -> 
Calc(select=[app_id, ip], where=[((ts SEARCH Sarg[[1613955600..1613959200)]) 
AND (action SEARCH Sarg[_UTF-16LE'PV':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"]:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"))]) -> 
Expand(projects=[app_id, ip, $e], projects=[{app_id, ip, 0 AS $e}, {app_id, 
null AS ip, 1 AS $e}]) -> LocalHashAggregate(groupBy=[app_id, ip, $e], 
select=[app_id, ip, $e, Partial_COUNT(*) AS count1$0]) (1/1) 
(c72791dbe1987c48cb4c302bfe422a62) switched from CREATED to SCHEDULED.
2021-02-22 11:07:05,150 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve 
slot request, no ResourceManager connected. Adding as pending request 
[SlotRequestId{e93cb76a2cf4a64e5498ecb9be4d93d0}]
2021-02-22 11:07:05,151 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
TableSourceScan(table=[[default_catalog, default_database, rowDataTable, 
project=[app_id, path, ts, action, uid]]], fields=[app_id, path, ts, action, 
uid]) -> Calc(select=[app_id, uid, path], where=[((ts SEARCH 
Sarg[[1613955600..1613959200)]) AND (action SEARCH 
Sarg[_UTF-16LE'PV':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"]:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"))]) -> 
LocalHashAggregate(groupBy=[app_id, uid, path], select=[app_id, uid, path]) 
(1/1) (4b57afd56e9543727b72ad6b86e088d1) switched from CREATED to SCHEDULED.
2021-02-22 11:07:05,151 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve 
slot request, no ResourceManager connected. Adding as pending request 
[SlotRequestId{05243bdf664021ad623b0714e97c9f9d}]
2021-02-22 11:07:05,151 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
TableSourceScan(table=[[default_catalog, default_database, rowDataTable, 
project=[app_id, ts, action, dac, uid]]], fields=[app_id, ts, action, dac, 
uid]) -> Calc(select=[app_id, CAST(dac) AS $f1, uid], where=[((ts SEARCH 
Sarg[[1613955600..1613959200)]) AND (action SEARCH 
Sarg[_UTF-16LE'HB':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"]:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"))]) -> 
Expand(projects=[app_id, $f1, uid, $e], projects=[{app_id, $f1, uid, 0 AS $e}, 
{app_id, $f1, null AS uid, 1 AS $e}]) -> LocalHashAggregate(groupBy=[app_id, 
uid, $e], select=[app_id, uid, $e, Partial_SUM($f1) AS sum$0]) (1/1) 
(953ab9777f59696592b246613c519a8e) switched from CREATED to SCHEDULED.
2021-02-22 11:07:05,151 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve 
slot request, no ResourceManager connected. Adding as pending request 
[SlotRequestId{e4425f7443819b5db22f0204db200431}]

{code}


> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> 

[jira] [Commented] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-21 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288136#comment-17288136
 ] 

hayden zhou commented on FLINK-21325:
-

how to get logs for the specific failed job?[~rmetzger]

> NoResourceAvailableException while cancel then resubmit  jobs
> -
>
> Key: FLINK-21325
> URL: https://issues.apache.org/jira/browse/FLINK-21325
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
> Environment: FLINK 1.12  with 
> [flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
> restart problem on k8s HA session mode.
>Reporter: hayden zhou
>Priority: Major
>
>  I have five stream jobs and want to clear all states in jobs, so I canceled 
> all those jobs, then resubmitted one by one, resulting in two jobs are in 
> running status,  while three jobs are in created status with errors 
> ".NoResourceAvailableException: Slot request bulk is not fulfillable! Could 
> not allocate the required slot within slot request timeout
> "
> I am sure my slots are sufficient.
> and this problem were fixed by restart k8s jm an tm pods.
> below is the error logs:
> {code:java}
> ava.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
>  at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
>  at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
>  at 
> 

[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Description: 
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"
I am sure my slots are sufficient.

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}


  was:
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required 

[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Description: 
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}


  was:
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 

[jira] [Created] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)
hayden zhou created FLINK-21325:
---

 Summary: NoResourceAvailableException while cancel then resubmit  
jobs
 Key: FLINK-21325
 URL: https://issues.apache.org/jira/browse/FLINK-21325
 Project: Flink
  Issue Type: Bug
 Environment: FLINK 1.12  with 
[flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
restart problem on k8s HA session mode.
Reporter: hayden zhou


 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2021-01-12 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-20798:

Description: 

When deploying standalone Flink on Kubernetes and configure the 
{{high-availability.storageDir}} to a mounted PVC directory, the Flink webui 
could not be visited normally. It shows that "Service temporarily unavailable 
due to an ongoing leader election. Please refresh".

 


The following is related logs from JobManager.

{code}
2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
election started
 2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Attempting to acquire leader lease 'ConfigMapLock: default - 
mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
 2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
 2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
 2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
Connecting websocket ... 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
 2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
 2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
 2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
 2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Starting DefaultLeaderElectionService with 
KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
 2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
 2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
mta-flink-restserver-leader.
 2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Successfully Acquired leader lease 'ConfigMapLock: default - 
mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
 2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Grant 
leadership to contender 
[http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] with 
session ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
 2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
[http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] was 
granted leadership with leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
 2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
[http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/].
 2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
 2020-12-29T06:45:54.254588299Z 2020-12-29 14:45:54,254 INFO 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
mta-flink-resourcemanager-leader.
 2020-12-29T06:45:54.254628053Z 2020-12-29 14:45:54,254 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Successfully Acquired leader lease 'ConfigMapLock: default - 
mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
 2020-12-29T06:45:54.254871569Z 2020-12-29 14:45:54,254 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Grant 
leadership to contender LeaderContender: StandaloneResourceManager with session 
ID 

[jira] [Created] (FLINK-20938) Implement Flink's own tencent COS filesystem

2021-01-12 Thread hayden zhou (Jira)
hayden zhou created FLINK-20938:
---

 Summary: Implement Flink's own tencent COS filesystem
 Key: FLINK-20938
 URL: https://issues.apache.org/jira/browse/FLINK-20938
 Project: Flink
  Issue Type: New Feature
  Components: FileSystems
Affects Versions: 1.12.0
Reporter: hayden zhou


Tencent's COS is widely used among China's cloud users, and Hadoop supports 
Tencent COS since 3.3.0. 

Open this jira to wrap CosNFileSystem in FLINK(similar to oss support), so that 
users can read from & write to COS more easily in FLINK. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2021-01-02 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou closed FLINK-20798.
---
Resolution: Fixed

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
> When deploying standalone Flink on Kubernetes and configure the 
> {{high-availability.storageDir}} to a mounted PVC directory, the Flink webui 
> could not be visited normally. It shows that "Service temporarily unavailable 
> due to an ongoing leader election. Please refresh".
>  
> 下面是 jobmanager 的日志
> The following is related logs from JobManager.
> {code}
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
>  2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
>  2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
>  2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
>  2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
>  2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
>  2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
>  2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
>  2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
>  2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] with 
> session ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
>  2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] was 
> granted leadership with leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
>  2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/].
>  2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 

[jira] [Commented] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2020-12-30 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256416#comment-17256416
 ] 

hayden zhou commented on FLINK-20798:
-

great thanks to [~fly_in_gis], so patiently to help me debug then find the 
problem 

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
> When deploying standalone Flink on Kubernetes and configure the 
> {{high-availability.storageDir}} to a mounted PVC directory, the Flink webui 
> could not be visited normally. It shows that "Service temporarily unavailable 
> due to an ongoing leader election. Please refresh".
>  
> 下面是 jobmanager 的日志
> The following is related logs from JobManager.
> ```
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
>  2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
>  2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
>  2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
>  2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
>  2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
>  2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
>  2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
>  2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
>  2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
>  2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] with 
> session ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
>  2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/] was 
> granted leadership with leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
>  2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> [http://mta-flink-jobmanager:8081|http://mta-flink-jobmanager:8081/].
>  2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> 

[jira] [Comment Edited] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2020-12-29 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256280#comment-17256280
 ] 

hayden zhou edited comment on FLINK-20798 at 12/30/20, 4:39 AM:


[^flink.log] this is the log

 

my flink-conf.yaml

 

kubernetes.cluster-id: mta-flink
 high-availability: org.apache.flink.kubernetes.highavailability. 
KubernetesHaServicesFactory
 high-availability.storageDir: file:///opt/flink/nfs/ha

 

 

can I just use  FileSystemHaServiceFactory to replace the 
KubernetesHaServicesFactory in the configMap? 


was (Author: hayden zhou):
[^flink.log] this is the log

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
>  
> 下面是 jobmanager 的日志
> ```
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
> 2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
> 2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
> 2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
> 2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
> 2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
> 2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender http://mta-flink-jobmanager:8081 with session 
> ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
> 2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> http://mta-flink-jobmanager:8081 was granted leadership with 
> leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
> 2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> http://mta-flink-jobmanager:8081.
> 2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 

[jira] [Commented] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2020-12-29 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256281#comment-17256281
 ] 

hayden zhou commented on FLINK-20798:
-

can we chat on wechat? my wechat account `zmfk2009`

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
>  
> 下面是 jobmanager 的日志
> ```
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
> 2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
> 2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
> 2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
> 2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
> 2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
> 2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender http://mta-flink-jobmanager:8081 with session 
> ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
> 2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> http://mta-flink-jobmanager:8081 was granted leadership with 
> leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
> 2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> http://mta-flink-jobmanager:8081.
> 2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.254588299Z 2020-12-29 14:45:54,254 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-resourcemanager-leader.
> 2020-12-29T06:45:54.254628053Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector 

[jira] [Commented] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2020-12-29 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256280#comment-17256280
 ] 

hayden zhou commented on FLINK-20798:
-

[^flink.log] this is the log

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
>  
> 下面是 jobmanager 的日志
> ```
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
> 2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
> 2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
> 2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
> 2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
> 2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
> 2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender http://mta-flink-jobmanager:8081 with session 
> ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
> 2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> http://mta-flink-jobmanager:8081 was granted leadership with 
> leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
> 2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> http://mta-flink-jobmanager:8081.
> 2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.254588299Z 2020-12-29 14:45:54,254 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-resourcemanager-leader.
> 2020-12-29T06:45:54.254628053Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully 

[jira] [Updated] (FLINK-20798) Using PVC as high-availability.storageDir could not work

2020-12-29 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-20798:

Attachment: flink.log

> Using PVC as high-availability.storageDir could not work
> 
>
> Key: FLINK-20798
> URL: https://issues.apache.org/jira/browse/FLINK-20798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
> Environment: FLINK 1.12.0
>Reporter: hayden zhou
>Priority: Major
> Attachments: flink.log
>
>
> 我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
> 的日志,选举成功了。但是 web 一直显示选举进行中。
>  
> 下面是 jobmanager 的日志
> ```
> 2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader election started
> 2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Attempting to acquire leader lease 'ConfigMapLock: default - 
> mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
> 2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> Connecting websocket ... 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
> 2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with 
> KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
> 2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
> WebSocket successfully opened
> 2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Starting DefaultLeaderElectionService with 
> KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
> 2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-restserver-leader.
> 2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default - 
> mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
> 2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Grant leadership to contender http://mta-flink-jobmanager:8081 with session 
> ID 9587e13f-322f-4cd5-9fff-b4941462be0f.
> 2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
> http://mta-flink-jobmanager:8081 was granted leadership with 
> leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
> 2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
> http://mta-flink-jobmanager:8081.
> 2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Leader changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
> 2020-12-29T06:45:54.254588299Z 2020-12-29 14:45:54,254 INFO 
> org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
> New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
> mta-flink-resourcemanager-leader.
> 2020-12-29T06:45:54.254628053Z 2020-12-29 14:45:54,254 DEBUG 
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
> Successfully Acquired leader lease 'ConfigMapLock: default 

[jira] [Commented] (FLINK-17598) Implement FileSystemHAServices for native K8s setups

2020-12-29 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256246#comment-17256246
 ] 

hayden zhou commented on FLINK-17598:
-

[~fly_in_gis] I am trying to deploy Flink on k8s with HA mode, I use PV as the 
HA `storageDir`,  it seems the leader election is successfully

but I got error "Service temporarily unavailable due to an ongoing leader 
election. Please refresh." if I submit job

> Implement FileSystemHAServices for native K8s setups
> 
>
> Key: FLINK-17598
> URL: https://issues.apache.org/jira/browse/FLINK-17598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Reporter: Canbin Zheng
>Priority: Major
>
> At the moment we use Zookeeper as a distributed coordinator for implementing 
> JobManager high availability services. But in the cloud-native environment, 
> there is a trend that more and more users prefer to use *Kubernetes* as the 
> underlying scheduler backend while *Storage Object* as the Storage medium, 
> both of these two services don't require Zookeeper deployment.
> As a result, in the K8s setups, people have to deploy and maintain their 
> Zookeeper clusters for solving JobManager SPOF. This ticket proposes to 
> provide a simplified FileSystem HA implementation with the leader-election 
> removed, which saves the efforts of Zookeeper deployment.
> To achieve this, we plan to 
> # Introduce a {{FileSystemHaServices}} which implements the 
> {{HighAvailabilityServices}}.
> # Replace Deployment with StatefulSet to ensure *at most one* semantics, 
> preventing potential concurrent access to the underlying FileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17598) Implement FileSystemHAServices for native K8s setups

2020-12-29 Thread hayden zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256246#comment-17256246
 ] 

hayden zhou edited comment on FLINK-17598 at 12/30/20, 3:07 AM:


[~fly_in_gis] I am trying to deploy Flink on k8s with HA mode, I use PV as the 
HA `storageDir`,  it seems the leader election is successfully

but I got error "Service temporarily unavailable due to an ongoing leader 
election. Please refresh." if I submit job.

details as below:

https://stackoverflow.com/questions/65487789/can-flink-on-k8s-use-pv-using-nfs-and-pvc-as-the-high-avalibility-storagedir


was (Author: hayden zhou):
[~fly_in_gis] I am trying to deploy Flink on k8s with HA mode, I use PV as the 
HA `storageDir`,  it seems the leader election is successfully

but I got error "Service temporarily unavailable due to an ongoing leader 
election. Please refresh." if I submit job

> Implement FileSystemHAServices for native K8s setups
> 
>
> Key: FLINK-17598
> URL: https://issues.apache.org/jira/browse/FLINK-17598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Reporter: Canbin Zheng
>Priority: Major
>
> At the moment we use Zookeeper as a distributed coordinator for implementing 
> JobManager high availability services. But in the cloud-native environment, 
> there is a trend that more and more users prefer to use *Kubernetes* as the 
> underlying scheduler backend while *Storage Object* as the Storage medium, 
> both of these two services don't require Zookeeper deployment.
> As a result, in the K8s setups, people have to deploy and maintain their 
> Zookeeper clusters for solving JobManager SPOF. This ticket proposes to 
> provide a simplified FileSystem HA implementation with the leader-election 
> removed, which saves the efforts of Zookeeper deployment.
> To achieve this, we plan to 
> # Introduce a {{FileSystemHaServices}} which implements the 
> {{HighAvailabilityServices}}.
> # Replace Deployment with StatefulSet to ensure *at most one* semantics, 
> preventing potential concurrent access to the underlying FileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20798) Service temporarily unavailable due to an ongoing leader election. Please refresh.

2020-12-28 Thread hayden zhou (Jira)
hayden zhou created FLINK-20798:
---

 Summary: Service temporarily unavailable due to an ongoing leader 
election. Please refresh.
 Key: FLINK-20798
 URL: https://issues.apache.org/jira/browse/FLINK-20798
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.12.0
 Environment: FLINK 1.12.0
Reporter: hayden zhou


我这边 部署 flink 到 k8s 使用 PVC 作为 high avalibility storagedir , 我看jobmanager 
的日志,选举成功了。但是 web 一直显示选举进行中。

 

下面是 jobmanager 的日志

```

2020-12-29T06:45:54.177850394Z 2020-12-29 14:45:54,177 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
election started
2020-12-29T06:45:54.177855303Z 2020-12-29 14:45:54,177 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Attempting to acquire leader lease 'ConfigMapLock: default - 
mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'...
2020-12-29T06:45:54.178668055Z 2020-12-29 14:45:54,178 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
2020-12-29T06:45:54.178895963Z 2020-12-29 14:45:54,178 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-resourcemanager-leader'}.
2020-12-29T06:45:54.179327491Z 2020-12-29 14:45:54,179 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - 
Connecting websocket ... 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@6d303498
2020-12-29T06:45:54.230081993Z 2020-12-29 14:45:54,229 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
2020-12-29T06:45:54.230202329Z 2020-12-29 14:45:54,230 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
KubernetesLeaderRetrievalDriver\{configMapName='mta-flink-dispatcher-leader'}.
2020-12-29T06:45:54.230219281Z 2020-12-29 14:45:54,229 DEBUG 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - WebSocket 
successfully opened
2020-12-29T06:45:54.230353912Z 2020-12-29 14:45:54,230 INFO 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Starting DefaultLeaderElectionService with 
KubernetesLeaderElectionDriver\{configMapName='mta-flink-resourcemanager-leader'}.
2020-12-29T06:45:54.237004177Z 2020-12-29 14:45:54,236 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
2020-12-29T06:45:54.237024655Z 2020-12-29 14:45:54,236 INFO 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
mta-flink-restserver-leader.
2020-12-29T06:45:54.237027811Z 2020-12-29 14:45:54,236 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Successfully Acquired leader lease 'ConfigMapLock: default - 
mta-flink-restserver-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
2020-12-29T06:45:54.237297376Z 2020-12-29 14:45:54,237 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Grant 
leadership to contender http://mta-flink-jobmanager:8081 with session ID 
9587e13f-322f-4cd5-9fff-b4941462be0f.
2020-12-29T06:45:54.237353551Z 2020-12-29 14:45:54,237 INFO 
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - 
http://mta-flink-jobmanager:8081 was granted leadership with 
leaderSessionID=9587e13f-322f-4cd5-9fff-b4941462be0f
2020-12-29T06:45:54.237440354Z 2020-12-29 14:45:54,237 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Confirm leader session ID 9587e13f-322f-4cd5-9fff-b4941462be0f for leader 
http://mta-flink-jobmanager:8081.
2020-12-29T06:45:54.254555127Z 2020-12-29 14:45:54,254 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - Leader 
changed from null to 6f6479c6-86cc-4d62-84f9-37ff968bd0e5
2020-12-29T06:45:54.254588299Z 2020-12-29 14:45:54,254 INFO 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - 
New leader elected 6f6479c6-86cc-4d62-84f9-37ff968bd0e5 for 
mta-flink-resourcemanager-leader.
2020-12-29T06:45:54.254628053Z 2020-12-29 14:45:54,254 DEBUG 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] - 
Successfully Acquired leader lease 'ConfigMapLock: default - 
mta-flink-resourcemanager-leader (6f6479c6-86cc-4d62-84f9-37ff968bd0e5)'
2020-12-29T06:45:54.254871569Z 2020-12-29 14:45:54,254 DEBUG 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Grant 
leadership to contender LeaderContender: StandaloneResourceManager with session 
ID b1730dc6-0f94-49f4-b519-56917f3027b7.
2020-12-29T06:45:54.256608291Z 

[jira] [Created] (FLINK-20797) can flink on k8s use pv using NFS and pvc as the hight avalibility storagedir

2020-12-28 Thread hayden zhou (Jira)
hayden zhou created FLINK-20797:
---

 Summary: can flink on k8s use pv using NFS and pvc as the hight 
avalibility storagedir
 Key: FLINK-20797
 URL: https://issues.apache.org/jira/browse/FLINK-20797
 Project: Flink
  Issue Type: New Feature
  Components: Client / Job Submission
 Environment: FLINK 1.12.0

 
Reporter: hayden zhou


I want to deploy Flink on k8s with HA mode, and I don't want to deploy the HDFS 
cluster, and I have an NFS so that I am created a PV that use NFS as the 
backend storage, and I created a PVC for deployment mount.

this is my FLINK configMap

```

kubernetes.cluster-id: mta-flink
 high-availability: org.apache.flink.kubernetes.highavailability. 
KubernetesHaServicesFactory
 high-availability.storageDir: file:///opt/flink/nfs/ha

```

and this is my jobmanager yaml file:

```

volumeMounts:
 - name: flink-config-volume
 mountPath: /opt/flink/conf
 - name: flink-nfs
 mountPath: /opt/flink/nfs
 securityContext:
 runAsUser:  # refers to user _flink_ from official flink image, change if 
necessary
 #fsGroup: 
 volumes:
 - name: flink-config-volume
 configMap:
 name: mta-flink-config
 items:
 - key: flink-conf.yaml
 path: flink-conf.yaml
 - key: log4j-console.properties
 path: log4j-console.properties
 - name: flink-nfs
 persistentVolumeClaim:
 claimName: mta-flink-nfs-pvc

```

It can be deployed successfully, but if I browser the jobmanager:8081 website, 
I get the result below:

```

{"errors": ["Service temporarily unavailable due to an ongoing leader election. 
Please refresh."]}

```

 

is the PVC can be used as `high-availability.storageDir`?  if it's can be used, 
how can I fix this error?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)