[jira] [Closed] (FLINK-32461) manage union operator state increase very large in Jobmanager

2023-07-01 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn closed FLINK-32461.

Release Note: it's a usage problem
  Resolution: Fixed

> manage  union operator state increase very large in Jobmanager 
> ---
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-32461) manage union operator state increase very large in Jobmanager

2023-06-30 Thread wgcn (Jira)


[ https://issues.apache.org/jira/browse/FLINK-32461 ]


wgcn deleted comment on FLINK-32461:
--

was (Author: 1026688210):
the issue 
This issue is related to https://issues.apache.org/jira/browse/FLINK-21436 . 
Can we reopen it and continue the discussion?


> manage  union operator state increase very large in Jobmanager 
> ---
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32461) manage union operator state increase very large in Jobmanager

2023-06-30 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738928#comment-17738928
 ] 

wgcn commented on FLINK-32461:
--

the issue 
This issue is related to https://issues.apache.org/jira/browse/FLINK-21436 . 
Can we reopen it and continue the discussion?


> manage  union operator state increase very large in Jobmanager 
> ---
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage union operator state increase very large in Jobmanager

2023-06-29 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Summary: manage  union operator state increase very large in Jobmanager   
(was: manage operator state increase very large )

> manage  union operator state increase very large in Jobmanager 
> ---
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-32461) manage operator state increase very large

2023-06-29 Thread wgcn (Jira)


[ https://issues.apache.org/jira/browse/FLINK-32461 ]


wgcn deleted comment on FLINK-32461:
--

was (Author: 1026688210):
>> 
 This issue doesn't usually occur, but it happens during busy nights when 
the machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.

The task  only has 2 topic,and I haven't figured out why the state is so 
large.

> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-29 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Description: 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I found the number of  operator union state object is 128 
,same with the parallelism .Whether the union state only needs to be loaded 
once?

 !screenshot-1.png! 
 !image-2023-06-28-16-24-11-538.png! 


  was:
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I found the number of  operator union state object is 128 
,same with the parallelism .Whether the union state only needs to be loaded 
once?
 !screenshot-1.png! 
 !image-2023-06-28-16-24-11-538.png! 



> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-29 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Description: 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I found the number of  operator union state object is 128 
,same with the parallelism .Whether the union state only needs to be loaded 
once?
 !screenshot-1.png! 
 !image-2023-06-28-16-24-11-538.png! 


  was:
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.
 !screenshot-1.png! 
 !image-2023-06-28-16-24-11-538.png! 



> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I found the number of  operator union state object is 128 
> ,same with the parallelism .Whether the union state only needs to be loaded 
> once?
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Description: 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.
 !screenshot-1.png! 
 !image-2023-06-28-16-24-11-538.png! 


  was:

This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.
 !image-2023-06-28-16-24-11-538.png! 
 !screenshot-1.png! 


> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.
>  !screenshot-1.png! 
>  !image-2023-06-28-16-24-11-538.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
 Attachment: image-2023-06-28-16-24-11-538.png
Description: 

This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.
 !image-2023-06-28-16-24-11-538.png! 
 !screenshot-1.png! 

  was:
 !screenshot-1.png! 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.


> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png
>
>
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.
>  !image-2023-06-28-16-24-11-538.png! 
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Attachment: (was: image-2023-06-28-15-57-52-615.png)

> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738010#comment-17738010
 ] 

wgcn commented on FLINK-32461:
--

>> 
 This issue doesn't usually occur, but it happens during busy nights when 
the machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.

The task  only has 2 topic,and I haven't figured out why the state is so 
large.

> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png
>
>
>  !screenshot-1.png! 
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Attachment: screenshot-1.png

> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png
>
>
>  !image-2023-06-28-15-57-39-557.png! 
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32461:
-
Description: 
 !screenshot-1.png! 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.

  was:
 !image-2023-06-28-15-57-39-557.png! 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.


> manage operator state increase very large 
> --
>
> Key: FLINK-32461
> URL: https://issues.apache.org/jira/browse/FLINK-32461
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png
>
>
>  !screenshot-1.png! 
> This issue doesn't usually occur, but it happens during busy nights when the 
> machines are more active. The "manage operator state" will increase 
> significantly, and I can see that it stores Kafka offsets inside mostly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32461) manage operator state increase very large

2023-06-28 Thread wgcn (Jira)
wgcn created FLINK-32461:


 Summary: manage operator state increase very large 
 Key: FLINK-32461
 URL: https://issues.apache.org/jira/browse/FLINK-32461
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.1
Reporter: wgcn
 Attachments: image-2023-06-28-15-57-52-615.png

 !image-2023-06-28-15-57-39-557.png! 
This issue doesn't usually occur, but it happens during busy nights when the 
machines are more active. The "manage operator state" will increase 
significantly, and I can see that it stores Kafka offsets inside mostly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

2023-06-18 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733968#comment-17733968
 ] 

wgcn commented on FLINK-32319:
--

It work after I increase "taskmanager.memory.network",but I'm not sure why this 
is happening, as the Flink Task was functioning normally when it was initially 
started. After some time, there is a chance that this issue occurs upon 
restart, which has not been encountered in Flink 1.12 version,and I have 
calculatd the number of float buffers, buffer size, and the number of buffers 
for each channel. 600MB should be enough.Is this issue due to a new mechanism 
causing usage problems? or is it an unexpected issue?

> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-argssql2
> $internal.deployment.config-dir   /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout  100s
> blob.server.port  15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order parent-first
> env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attachedtrue
> execution.checkpointing.aligned-checkpoint-timeout10 min
> execution.checkpointing.externalized-checkpoint-retention 
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval  10 min
> execution.checkpointing.min-pause 10 min
> execution.savepoint-restore-mode  NO_CLAIM
> execution.savepoint.ignore-unclaimed-statefalse
> execution.shutdown-on-attached-exit   false
> execution.target  embedded
> high-availability zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs:///user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorumx
> internal.cluster.execution-mode   NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategyregion
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max1073741824b
> jobmanager.memory.jvm-overhead.min1073741824b
> jobmanager.memory.off-heap.size   134217728b
> jobmanager.memory.process.size10240m
> jobmanager.rpc.address
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  :9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix  false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar
> rest.address  
> rest.bind-address x
> rest.bind-port5-50500
> rest.flamegraph.enabled   true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval   6
> restart-strategy.type exponential-delay
> state.backend.typefilesystem
> state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained3
> taskmanager.memory.managed.fraction   0
> 

[jira] [Comment Edited] (FLINK-32319) flink can't the partition of network after restart

2023-06-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732799#comment-17732799
 ] 

wgcn edited comment on FLINK-32319 at 6/14/23 11:26 PM:


Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max from 
1 to 2 and 3, this problem keeps occurring,Is this related to the 
config "-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network 
buffer required by the task manager is not very large, so I decreased it. 


was (Author: 1026688210):
Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max to 2 
and 3, this problem keeps occurring,Is this related to the config 
"-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer 
required by the task manager is not very large, so I decreased it. 

> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-argssql2
> $internal.deployment.config-dir   /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout  100s
> blob.server.port  15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order parent-first
> env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attachedtrue
> execution.checkpointing.aligned-checkpoint-timeout10 min
> execution.checkpointing.externalized-checkpoint-retention 
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval  10 min
> execution.checkpointing.min-pause 10 min
> execution.savepoint-restore-mode  NO_CLAIM
> execution.savepoint.ignore-unclaimed-statefalse
> execution.shutdown-on-attached-exit   false
> execution.target  embedded
> high-availability zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs:///user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorumx
> internal.cluster.execution-mode   NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategyregion
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max1073741824b
> jobmanager.memory.jvm-overhead.min1073741824b
> jobmanager.memory.off-heap.size   134217728b
> jobmanager.memory.process.size10240m
> jobmanager.rpc.address
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  :9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix  false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar
> rest.address  
> rest.bind-address x
> rest.bind-port5-50500
> rest.flamegraph.enabled   true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval   6
> restart-strategy.type exponential-delay
> state.backend.type

[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

2023-06-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732799#comment-17732799
 ] 

wgcn commented on FLINK-32319:
--

Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max to 2 
and 3, this problem keeps occurring,Is this related to the config 
"-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer 
required by the task manager is not very large, so I decreased it. 

> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-argssql2
> $internal.deployment.config-dir   /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout  100s
> blob.server.port  15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order parent-first
> env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attachedtrue
> execution.checkpointing.aligned-checkpoint-timeout10 min
> execution.checkpointing.externalized-checkpoint-retention 
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval  10 min
> execution.checkpointing.min-pause 10 min
> execution.savepoint-restore-mode  NO_CLAIM
> execution.savepoint.ignore-unclaimed-statefalse
> execution.shutdown-on-attached-exit   false
> execution.target  embedded
> high-availability zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs:///user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorumx
> internal.cluster.execution-mode   NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategyregion
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max1073741824b
> jobmanager.memory.jvm-overhead.min1073741824b
> jobmanager.memory.off-heap.size   134217728b
> jobmanager.memory.process.size10240m
> jobmanager.rpc.address
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  :9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix  false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar
> rest.address  
> rest.bind-address x
> rest.bind-port5-50500
> rest.flamegraph.enabled   true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval   6
> restart-strategy.type exponential-delay
> state.backend.typefilesystem
> state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained3
> taskmanager.memory.managed.fraction   0
> taskmanager.memory.network.max600mb
> taskmanager.memory.process.size   10240m
> taskmanager.memory.segment-size   128kb
> taskmanager.network.memory.buffers-per-channel8
> 

[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

2023-06-12 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731827#comment-17731827
 ] 

wgcn commented on FLINK-32319:
--

hi~ [~Wencong Liu]  thanks for your response,  I will try the config, I have a 
question. I just roughly looked at the meaning of this config. Why does this 
config need to be set so large? Is it related to parallel reading?  We are 
using version 1.12 of Flink in our production environment, and I have never 
paid attention to this config before. Is this because some mechanisms were 
added after version 1.12 

> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-argssql2
> $internal.deployment.config-dir   /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout  100s
> blob.server.port  15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order parent-first
> env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attachedtrue
> execution.checkpointing.aligned-checkpoint-timeout10 min
> execution.checkpointing.externalized-checkpoint-retention 
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval  10 min
> execution.checkpointing.min-pause 10 min
> execution.savepoint-restore-mode  NO_CLAIM
> execution.savepoint.ignore-unclaimed-statefalse
> execution.shutdown-on-attached-exit   false
> execution.target  embedded
> high-availability zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs:///user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorumx
> internal.cluster.execution-mode   NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategyregion
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max1073741824b
> jobmanager.memory.jvm-overhead.min1073741824b
> jobmanager.memory.off-heap.size   134217728b
> jobmanager.memory.process.size10240m
> jobmanager.rpc.address
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  :9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix  false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar
> rest.address  
> rest.bind-address x
> rest.bind-port5-50500
> rest.flamegraph.enabled   true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval   6
> restart-strategy.type exponential-delay
> state.backend.typefilesystem
> state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained3
> taskmanager.memory.managed.fraction   0
> taskmanager.memory.network.max600mb
> taskmanager.memory.process.size   10240m
> 

[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart

2023-06-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32319:
-
Environment: 
centos 7.
jdk 8.
flink1.17.1 application mode on yarn 

flink configuration :

```

$internal.application.program-args  sql2
$internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf
$internal.yarn.log-config-file  
/data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
akka.ask.timeout100s
blob.server.port15402
classloader.check-leaked-classloaderfalse
classloader.resolve-order   parent-first
env.java.opts.taskmanager   -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
execution.attached  true
execution.checkpointing.aligned-checkpoint-timeout  10 min
execution.checkpointing.externalized-checkpoint-retention   
RETAIN_ON_CANCELLATION
execution.checkpointing.interval10 min
execution.checkpointing.min-pause   10 min
execution.savepoint-restore-modeNO_CLAIM
execution.savepoint.ignore-unclaimed-state  false
execution.shutdown-on-attached-exit false
execution.targetembedded
high-availability   zookeeper
high-availability.cluster-idapplication_1684133071014_7202676
high-availability.storageDirhdfs:///user/flink/recovery
high-availability.zookeeper.path.root   /flink
high-availability.zookeeper.quorum  x
internal.cluster.execution-mode NORMAL
internal.io.tmpdirs.use-local-default   true
io.tmp.dirs 
/data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
jobmanager.execution.failover-strategy  region
jobmanager.memory.heap.size 9261023232b
jobmanager.memory.jvm-metaspace.size268435456b
jobmanager.memory.jvm-overhead.max  1073741824b
jobmanager.memory.jvm-overhead.min  1073741824b
jobmanager.memory.off-heap.size 134217728b
jobmanager.memory.process.size  10240m
jobmanager.rpc.address  
jobmanager.rpc.port 31332
metrics.reporter.promgateway.deleteOnShutdown   true
metrics.reporter.promgateway.factory.class  
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
metrics.reporter.promgateway.hostUrl:9091
metrics.reporter.promgateway.interval   60 SECONDS
metrics.reporter.promgateway.jobNamejoin_phase3_v7
metrics.reporter.promgateway.randomJobNameSuffixfalse
parallelism.default 128
pipeline.classpaths 
pipeline.jars   
file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar
rest.address
rest.bind-address   x
rest.bind-port  5-50500
rest.flamegraph.enabled true
restart-strategy.failure-rate.delay 10 s
restart-strategy.failure-rate.failure-rate-interval 1 min
restart-strategy.failure-rate.max-failures-per-interval 6
restart-strategy.type   exponential-delay
state.backend.type  filesystem
state.checkpoints.dir   hdfs://xx/user/flink/checkpoints-data/wgcn
state.checkpoints.num-retained  3
taskmanager.memory.managed.fraction 0
taskmanager.memory.network.max  600mb
taskmanager.memory.process.size 10240m
taskmanager.memory.segment-size 128kb
taskmanager.network.memory.buffers-per-channel  8
taskmanager.network.memory.floating-buffers-per-gate800
taskmanager.numberOfTaskSlots   2
web.port0
web.tmpdir  /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
yarn.application-attempt-failures-validity-interval 6
yarn.application-attempts   3
yarn.application.name   join_phase3_v7
yarn.heartbeat.container-request-interval   700
```

  was:
centos 7.
jdk 8.
flink1.17.1 application mode on yarn 


> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on 

[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart

2023-06-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-32319:
-
   Attachment: image-2023-06-13-07-14-48-958.png
  Component/s: Runtime / Network
Affects Version/s: 1.17.1
  Description: 
flink can't the partition of network after restart, lead that job can not 
restoring
 !image-2023-06-13-07-14-48-958.png! 
  Environment: 
centos 7.
jdk 8.
flink1.17.1 application mode on yarn 

> flink can't the partition of network after restart
> --
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
>Reporter: wgcn
>Priority: Major
> Attachments: image-2023-06-13-07-14-48-958.png
>
>
> flink can't the partition of network after restart, lead that job can not 
> restoring
>  !image-2023-06-13-07-14-48-958.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32319) flink can't the partition of network after restart

2023-06-12 Thread wgcn (Jira)
wgcn created FLINK-32319:


 Summary: flink can't the partition of network after restart
 Key: FLINK-32319
 URL: https://issues.apache.org/jira/browse/FLINK-32319
 Project: Flink
  Issue Type: Bug
Reporter: wgcn






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30720) KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource failed due to a topic already exist when creating it

2023-02-26 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693657#comment-17693657
 ] 

wgcn commented on FLINK-30720:
--

Hi  [~mapohl] ,does this Exception occur in every  test? 
Could you tell me the steps  causing the Exception occur ,I wanna test it at my 
local environment.

> KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource failed due to a 
> topic already exist when creating it
> ---
>
> Key: FLINK-30720
> URL: https://issues.apache.org/jira/browse/FLINK-30720
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Kafka
>Affects Versions: 1.17.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
>
> We experienced a build failure in 
> {{KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource}} due to an 
> already existing topic:
> {code}
> Jan 17 14:15:33 [ERROR] 
> org.apache.flink.streaming.connectors.kafka.table.KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource
>   Time elapsed: 14.771 s  <<< ERROR!
> Jan 17 14:15:33 java.lang.IllegalStateException: Fail to create topic 
> [changelog_topic partitions: 1 replication factor: 1].
> Jan 17 14:15:33   at 
> org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestBase.createTestTopic(KafkaTableTestBase.java:143)
> Jan 17 14:15:33   at 
> org.apache.flink.streaming.connectors.kafka.table.KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource(KafkaChangelogTableITCase.java:60)
> Jan 17 14:15:33   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Jan 17 14:15:33   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jan 17 14:15:33   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jan 17 14:15:33   at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=44972=logs=aa18c3f6-13b8-5f58-86bb-c1cffb239496=502fb6c0-30a2-5e49-c5c2-a00fa3acb203=38188



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232000#comment-17232000
 ] 

wgcn edited comment on FLINK-20138 at 11/14/20, 12:17 PM:
--

[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on curator  , will it be finished in next version.


was (Author: 1026688210):
[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in next version.

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232000#comment-17232000
 ] 

wgcn edited comment on FLINK-20138 at 11/14/20, 12:13 PM:
--

[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in next version.


was (Author: 1026688210):
[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in latest version.

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232000#comment-17232000
 ] 

wgcn edited comment on FLINK-20138 at 11/14/20, 12:11 PM:
--

[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in latest version.


was (Author: 1026688210):
[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[链接标题|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in latest version.

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232000#comment-17232000
 ] 

wgcn commented on FLINK-20138:
--

[~trohrmann] , the log 'Connecting to ResourceManager x'  does not appear  
in the  jobmanager.
[链接标题|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in latest version.

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231966#comment-17231966
 ] 

wgcn commented on FLINK-20138:
--

[~trohrmann]  I upload the resource_manager_lock at the attachment,  the 
address on resource_manager_lock node is correct, which is  occupied by current 
jobmanager.  The problem is reproducible  diffcultly. I did not find the 
problem  again by trying  killing   the other   AM on yarn。

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-14 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Attachment: zk_resource_address_info.png

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn closed FLINK-18715.

Resolution: Won't Do

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.3
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231130#comment-17231130
 ] 

wgcn commented on FLINK-20138:
--

hi~ [~trohrmann]  please have a look at this issue

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Description: 
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

*SlotPoolImp always did not connect ResourceManager *
```
+_
2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
_+
```

*1.We did not find  the log of YarnResourceManager requesting container   at 
the jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .*



  was:
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```

2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]

```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .




> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> +_
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> _+
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Description: 
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

*SlotPoolImp always did not connect ResourceManager *
```

2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]

```

*1.We did not find  the log of YarnResourceManager requesting container   at 
the jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .*



  was:
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

*SlotPoolImp always did not connect ResourceManager *
```
+_
2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
_+
```

*1.We did not find  the log of YarnResourceManager requesting container   at 
the jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .*




> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Description: 
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```

2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]

```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .



  was:
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```
2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .




> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> SlotPoolImp always did not connect ResourceManager 
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> 1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Attachment: jobmanager.log

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> SlotPoolImp always did not connect ResourceManager 
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> 1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-20138:
-
Attachment: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> 
>
> Key: FLINK-20138
> URL: https://issues.apache.org/jira/browse/FLINK-20138
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Table SQL / Runtime
> Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>Reporter: wgcn
>Priority: Major
> Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> SlotPoolImp always did not connect ResourceManager 
> ```
> 2020-11-09 16:31:31,794   INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> 1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

2020-11-12 Thread wgcn (Jira)
wgcn created FLINK-20138:


 Summary: Flink Job can not recover due to  timeout of requiring 
slots when flink jobmanager restarted
 Key: FLINK-20138
 URL: https://issues.apache.org/jira/browse/FLINK-20138
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN, Table SQL / Runtime
 Environment: flink : 1.9.2
hadoop :2.7.2
jdk:1.8
Reporter: wgcn
 Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png

our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```
2020-11-09 16:31:31,794   INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19943) Support sink parallelism configuration to ElasticSearch connector

2020-11-03 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225333#comment-17225333
 ] 

wgcn commented on FLINK-19943:
--

I want to have a try  .Please  assign it to me [~lzljs3620320]

> Support sink parallelism configuration to ElasticSearch connector
> -
>
> Key: FLINK-19943
> URL: https://issues.apache.org/jira/browse/FLINK-19943
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / ElasticSearch
>Reporter: CloseRiver
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166396#comment-17166396
 ] 

wgcn commented on FLINK-18715:
--

[~trohrmann]  it's not suitable for the deployment scenarios  we don't have CPU 
isolation 

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166353#comment-17166353
 ] 

wgcn edited comment on FLINK-18715 at 7/28/20, 11:30 AM:
-

[~chesnay]   it indeed has a lot of system resources metric  , we talked about 
cpu occupation in single flink process


was (Author: 1026688210):
[~chesnay]   it indeed has a lot of system resources  , we talked about cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166353#comment-17166353
 ] 

wgcn edited comment on FLINK-18715 at 7/28/20, 11:27 AM:
-

[~chesnay]   it indeed has a lot of system resources  , we talked about cpu 
occupation in single flink process


was (Author: 1026688210):
[~chesnay]   it indeed has a lot of system resources  , we talked able cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166353#comment-17166353
 ] 

wgcn commented on FLINK-18715:
--

[~chesnay]   it indeed has a lot of system resources  , we talked able cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166213#comment-17166213
 ] 

wgcn commented on FLINK-18715:
--

Thanks for your concern [~trohrmann].   We use yarn as resource management 
system  which provide cpu isolation  function  by cgroup ,but we can not get  
the cpu ratio metric of   containers . So we add a metric in flink process.   
we can adjust the vcore according to the metric .  Maybe the improvement is 
suitable for yarn :)

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-25 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-18715:
-
Description: 
flink process add cpu usage metric,   user can determine that their job  is  io 
bound /cpu bound  ,so that they can increase/decrese cpu core in the container 
(k8s,yarn). If it's nessary 
.  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
of time .   it can get a value in single cpu core environment. and user can use 
the value  to calculate cpu usage ratio by dividing  num of container's cpu 
core.


  was:
flink process add cpu usage metric,   user can determine that their job  is  io 
bound /cpu bound  ,so that they can increase/decrese cpu core in the container 
(k8s,yarn). If it's necessary 
.  you can assign  it  to me 


> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ,I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-25 Thread wgcn (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-18715:
-
Description: 
flink process add cpu usage metric,   user can determine that their job  is  io 
bound /cpu bound  ,so that they can increase/decrese cpu core in the container 
(k8s,yarn). If it's necessary 
.  you can assign  it  to me 

  was:
flink process add cpu usage metric,   user can determine that their job  is  io 
bound /cpu bound  ,so that they can increase/decrese cpu core in the container 
(k8s,yarn). If it's nessary 
.  you can assign  it  to me 


> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's necessary 
> .  you can assign  it  to me 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-25 Thread wgcn (Jira)
wgcn created FLINK-18715:


 Summary: add cpu usage metric of  jobmanager/taskmanager  
 Key: FLINK-18715
 URL: https://issues.apache.org/jira/browse/FLINK-18715
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Affects Versions: 1.11.1
Reporter: wgcn
 Fix For: 1.12.0, 1.11.2


flink process add cpu usage metric,   user can determine that their job  is  io 
bound /cpu bound  ,so that they can increase/decrese cpu core in the container 
(k8s,yarn). If it's nessary 
.  you can assign  it  to me 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16360) connector on hive 2.0.1 don't support type conversion from STRING to VARCHAR

2020-03-01 Thread wgcn (Jira)
wgcn created FLINK-16360:


 Summary:  connector on hive 2.0.1 don't  support type conversion 
from STRING to VARCHAR
 Key: FLINK-16360
 URL: https://issues.apache.org/jira/browse/FLINK-16360
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Affects Versions: 1.10.0
 Environment: os:centos
java: 1.8.0_92

flink :1.10.0

hadoop: 2.7.2

hive:2.0.1

 
Reporter: wgcn
 Attachments: exceptionstack

 it threw  exception  when we query hive 2.0.1 by flink 1.10.0

 Exception stack:

org.apache.flink.runtime.JobException: Recovery is suppressed by 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=50, 
backoffTimeMS=1)
 at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
 at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484)
 at 
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:380)
 at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
 at 
org.apache.flink.orc.shim.OrcShimV200.createRecordReader(OrcShimV200.java:76)
 at 
org.apache.flink.orc.shim.OrcShimV200.createRecordReader(OrcShimV200.java:123)
 at org.apache.flink.orc.OrcSplitReader.(OrcSplitReader.java:73)
 at 
org.apache.flink.orc.OrcColumnarRowSplitReader.(OrcColumnarRowSplitReader.java:55)
 at 
org.apache.flink.orc.OrcSplitReaderUtil.genPartColumnarRowReader(OrcSplitReaderUtil.java:96)
 at 
org.apache.flink.connectors.hive.read.HiveVectorizedOrcSplitReader.(HiveVectorizedOrcSplitReader.java:65)
 at 
org.apache.flink.connectors.hive.read.HiveTableInputFormat.open(HiveTableInputFormat.java:117)
 at 
org.apache.flink.connectors.hive.read.HiveTableInputFormat.open(HiveTableInputFormat.java:56)
 at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:85)
 at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
 at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
 at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:196)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:204)
 at 

[jira] [Commented] (FLINK-15350) develop JDBC catalogs to connect to relational databases

2020-01-18 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018800#comment-17018800
 ] 

wgcn commented on FLINK-15350:
--

Hi~ [~phoenixjiangnan]  , I wanna know   which scene  JDBC catalog will be 
apply in stream/batch mode

> develop JDBC catalogs to connect to relational databases
> 
>
> Key: FLINK-15350
> URL: https://issues.apache.org/jira/browse/FLINK-15350
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / JDBC
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Major
>
> introduce AbastractJDBCCatalog and a set of JDBC catalog implementations to 
> connect Flink to all relational databases.
> Class hierarchy:
> {code:java}
> Catalog API 
> |
> AbstractJDBCCatalog
> |
> PostgresJDBCCatalog, MySqlJDBCCatalog, OracleJDBCCatalog, ...
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports

2020-01-18 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018773#comment-17018773
 ] 

wgcn commented on FLINK-15625:
--

[~jark]thks for your reply .  i will be  concern about 
[Flink-12828|https://issues.apache.org/jira/browse/FLINK-12828] and 
[FLINK-12845|https://issues.apache.org/jira/browse/FLINK-12845]

> flink sql multiple statements syntatic validation supports
> --
>
> Key: FLINK-15625
> URL: https://issues.apache.org/jira/browse/FLINK-15625
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / API, Table SQL / Planner
>Reporter: jackylau
>Priority: Major
> Fix For: 1.11.0
>
>
> we konw that blink(blink first commits ) parser and calcite parser all 
> support multiple statements now  and supports multiple statement syntatic 
> validation by calcite, which validates sql statements one by one, and it will 
> not validate the previous tablenames and others. and we only know the sql 
> syntatic error when we submit the flink applications. 
> I think it is eagerly need for users. we hope the flink community to support 
> it 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports

2020-01-17 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017909#comment-17017909
 ] 

wgcn commented on FLINK-15625:
--

hi , [~lzljs3620320]we can support a sql text  which contain multiple 
statements (such as 'insert into','select','ddl', 'sql comment'  etc) at 
streaming mode ,and then excute the multiple statements  by the order of 
multiple statements 

> flink sql multiple statements syntatic validation supports
> --
>
> Key: FLINK-15625
> URL: https://issues.apache.org/jira/browse/FLINK-15625
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Legacy Planner
>Reporter: jackylau
>Priority: Major
> Fix For: 1.10.0
>
>
> we konw that blink(blink first commits ) parser and calcite parser all 
> support multiple statements now  and supports multiple statement syntatic 
> validation by calcite, which validates sql statements one by one, and it will 
> not validate the previous tablenames and others. and we only know the sql 
> syntatic error when we submit the flink applications. 
> I think it is eagerly need for users. we hope the flink community to support 
> it 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports

2020-01-16 Thread wgcn (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017738#comment-17017738
 ] 

wgcn commented on FLINK-15625:
--

it's indeed a good improvement ,  and it's seem to  not complex to realize the 
improvement .   i wanna to try  it 

> flink sql multiple statements syntatic validation supports
> --
>
> Key: FLINK-15625
> URL: https://issues.apache.org/jira/browse/FLINK-15625
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Legacy Planner
>Reporter: jackylau
>Priority: Major
> Fix For: 1.10.0
>
>
> we konw that blink(blink first commits ) parser and calcite parser all 
> support multiple statements now  and supports multiple statement syntatic 
> validation by calcite, which validates sql statements one by one, and it will 
> not validate the previous tablenames and others. and we only know the sql 
> syntatic error when we submit the flink applications. 
> I think it is eagerly need for users. we hope the flink community to support 
> it 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-07-19 Thread wgcn (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn closed FLINK-12728.

Resolution: Resolved

*in  core-site.xml*

hadoop.proxyuser.yarn.hosts
*


hadoop.proxyuser.yarn.groups
*

 in   *yarn**-site.xml*

yarn.resourcemanager.proxy-user-privileges.enabled
true


>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-07-19 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888705#comment-16888705
 ] 

wgcn commented on FLINK-12728:
--

hi Tao Yang,Andrey Zagrebin   it's does work by  setting 
yarn.resourcemanager.proxy-user-privileges.enabled =true 
,hadoop.proxyuser.yarn.hosts=*,hadoop.proxyuser.yarn.groups=* ,and it may be 
useful for other user if  it's recorded in flink document.

>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-07-10 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881895#comment-16881895
 ] 

wgcn commented on FLINK-12728:
--

hi Andrey Zagrebin  The issue did happen again  after yarn restarted ,  we are 
looking for reason.   We will report it in time  after resolved it.  thanks for 
your _concern_

>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-06-05 Thread wgcn (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-12728:
-
Description: 
    job can't restart when flink  job  has been running for a long time and 
then taskmanager restarting   ,i find log in AM   that  AM  request containers  
taskmanager  all the time .      the  log in NodeManager show that  the new 
requested containers can't  downloading file from hdfs  because of kerberos . I 
 configed the keytab config that

security.kerberos.login.use-ticket-cache: false
 security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
 security.kerberos.login.principal: 

 at  flink-client machine  and  keytab  is exist.  

I showed the logs at AM and NodeManager below.

 

 

 

 

  was:
    job can't restart when flink  job  has been running for a long time and 
then taskmanager restarting   ,i find log in AM   that  AM  request containers  
taskmanager  all the time .      the  log in NodeManager show that  the new 
requested containers can't  downloading file from hdfs  because of kerberos . I 
 configed the keytab config that

security.kerberos.login.use-ticket-cache: false
 security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
 security.kerberos.login.principal: 
[flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
|mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]

 at  flink-client machine  and  keytab  is exist.  

I showed the logs at AM and NodeManager below.

 

 

 

 


>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-06-05 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856676#comment-16856676
 ] 

wgcn commented on FLINK-12728:
--

thanks ,Tao Yang .it does set  false.  We will try it.

>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
> [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
> |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-06-04 Thread wgcn (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-12728:
-
Description: 
    job can't restart when flink  job  has been running for a long time and 
then taskmanager restarting   ,i find log in AM   that  AM  request containers  
taskmanager  all the time .      the  log in NodeManager show that  the new 
requested containers can't  downloading file from hdfs  because of kerberos . I 
 configed the keytab config that

security.kerberos.login.use-ticket-cache: false
 security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
 security.kerberos.login.principal: 
[flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
|mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]

 at  flink-client machine  and  keytab  is exist.  

I showed the logs at AM and NodeManager below.

 

 

 

 

  was:
    job can't restart when flink  job  has been running for a long time and 
then taskmanager restarting   ,i find log in AM   that  AM  request containers  
taskmanager  all the time . log in NodeManager show that  the new requested 
containers can't  downloading file from hdfs  because of kerberos . I  configed 
the keytab config that

security.kerberos.login.use-ticket-cache: false
 security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
 security.kerberos.login.principal: 
[flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
|mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]

 at  flink-client machine  and  keytab  is exist.  

I showed the logs at AM and NodeManager below.

 

 

 

 


>   taskmanager  container  can't  launch  on nodemanager machine because of 
> kerberos
> ---
>
> Key: FLINK-12728
> URL: https://issues.apache.org/jira/browse/FLINK-12728
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.7.2
> Environment: linux 
> jdk8
> hadoop 2.7.2
> flink 1.7.2
>Reporter: wgcn
>Priority: Major
> Attachments: AM.log, NM.log
>
>
>     job can't restart when flink  job  has been running for a long time and 
> then taskmanager restarting   ,i find log in AM   that  AM  request 
> containers  taskmanager  all the time .      the  log in NodeManager show 
> that  the new requested containers can't  downloading file from hdfs  because 
> of kerberos . I  configed the keytab config that
> security.kerberos.login.use-ticket-cache: false
>  security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
>  security.kerberos.login.principal: 
> [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
> |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]
>  at  flink-client machine  and  keytab  is exist.  
> I showed the logs at AM and NodeManager below.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos

2019-06-04 Thread wgcn (JIRA)
wgcn created FLINK-12728:


 Summary:   taskmanager  container  can't  launch  on nodemanager 
machine because of kerberos
 Key: FLINK-12728
 URL: https://issues.apache.org/jira/browse/FLINK-12728
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.7.2
 Environment: linux 

jdk8

hadoop 2.7.2

flink 1.7.2
Reporter: wgcn
 Attachments: AM.log, NM.log

    job can't restart when flink  job  has been running for a long time and 
then taskmanager restarting   ,i find log in AM   that  AM  request containers  
taskmanager  all the time . log in NodeManager show that  the new requested 
containers can't  downloading file from hdfs  because of kerberos . I  configed 
the keytab config that

security.kerberos.login.use-ticket-cache: false
 security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab
 security.kerberos.login.principal: 
[flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. 
|mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.]

 at  flink-client machine  and  keytab  is exist.  

I showed the logs at AM and NodeManager below.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-28 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701574#comment-16701574
 ] 

wgcn commented on FLINK-10884:
--

https://github.com/apache/flink/pull/7185

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Assignee: wgcn
>Priority: Major
>  Labels: pull-request-available, yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-25 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698537#comment-16698537
 ] 

wgcn commented on FLINK-10884:
--

thanks for your suggestion:D

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Assignee: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-24 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698043#comment-16698043
 ] 

wgcn commented on FLINK-10884:
--

I think the  memory assigned to  jvm  not only contains  " heap memory" and 
"off heap memory"  in container. Do you mean that the offHeapSizeMB should be 
assigned to " (containerMemoryMB - heapSizeMB) *(1-cutoffFactor)"

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Assignee: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-20 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693129#comment-16693129
 ] 

wgcn commented on FLINK-10884:
--

I found a unit test that per default the off heap memory is set to what the 
network buffers require.  
[https://github.com/apache/flink/blob/f629b05a2cdc2c07f4a19456cf5b3e5fdd6ff607/flink-runtime/src/test/java/org/apache/flink/runtime/clusterframework/ContaineredTaskManagerParametersTest.java#L37]

 I guess that the author of the code  do that willfully .I don't know why the 
author do that 

 

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Assignee: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-17 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690502#comment-16690502
 ] 

wgcn commented on FLINK-10884:
--

Cool. it's my first experience in the community. I will fix it. Thank you

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-15 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689067#comment-16689067
 ] 

wgcn commented on FLINK-10884:
--

Thanks for your watch. I found the configuration option, but  code design 

"offHeapSizeMB = containerMemoryMB - heapSizeMB", the option don't work for the 
physical memory exceed."Heap memory"+"MaxDirectMemorySize"  is  still equal to 
Container  Memory  ,When ApplicationMaster  create TM's lanuching script   ,I 
don't know why the code design  like this 

 

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.6.2
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-14 Thread wgcn (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wgcn updated FLINK-10884:
-
Labels: yarn  (was: )

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.6.2
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-14 Thread wgcn (JIRA)
wgcn created FLINK-10884:


 Summary: Flink on yarn  TM container will be killed by nodemanager 
because of  the exceeded  physical memory.
 Key: FLINK-10884
 URL: https://issues.apache.org/jira/browse/FLINK-10884
 Project: Flink
  Issue Type: Bug
  Components: Cluster Management, Core
Affects Versions: 1.6.2
 Environment: version  : 1.6.2 

module : flink on yarn

centos  jdk1.8

hadoop 2.7
Reporter: wgcn


TM container will be killed by nodemanager because of  the exceeded  
[physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
 memory. I found the lanuch context   lanuching TM container  that  "container 
memory =   heap memory+ offHeapSizeMB"  at the class 
org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
from line 160 to 166  I set a safety margin for the whole memory container 
using. For example  if the container  limit 3g  memory,  the sum memory that   
"heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container being 
killed.Do we have the 
[ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
 solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5770) Flink yarn session stop in non-detached model

2018-10-30 Thread wgcn (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668448#comment-16668448
 ] 

wgcn commented on FLINK-5770:
-

your client maybe shutdown   you can use the arg -d 

> Flink yarn session stop in non-detached model
> -
>
> Key: FLINK-5770
> URL: https://issues.apache.org/jira/browse/FLINK-5770
> Project: Flink
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.2.0
> Environment: 1、the cluster contains 4 nodes;
> 2、every node has 380GB memory, and the CPU has  40 cores;
> 3、the OS is centOS7.2;
>Reporter: zhangrucong1982
>Priority: Major
>
> 1、I user the recent version of flink, and use fink in security mode without 
> HA.the configurations in flink-conf.yaml are:
> security.kerberos.login.keytab: 
> /home/demo/flink/release/flink-1.2.2/keytab/huawei1.keytab
> security.kerberos.login.principal: huawei1
> security.kerberos.login.contexts: Client,KafkaClient
> 2、then I use the command ./yarn-session.sh -n 2  to start the cluster with 
> two taskmanagers.
> 3、 But About the 4 hours later, the session is shutting down by itself. the 
> error stack is following:
> 2017-02-07 19:27:30,841 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink@9-96-101-251:38650] has failed, address is now gated for 
> [5000] ms. Reason: [Disassociated]
> 2017-02-07 19:27:42,804 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - Exception while running the interactive command line interface
> java.lang.RuntimeException: Unable to get ClusterClient status from 
> Application Client
> at 
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248)
> at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:410)
> at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:663)
> at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476)
> at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473)
> at 
> org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
> at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473)
> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: 
> Could not retrieve the leader gateway
> at 
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:142)
> at 
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:691)
> at 
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> ... 10 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [1 milliseconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:190)
> at scala.concurrent.Await.result(package.scala)
> at 
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:140)
> ... 12 more
> 4、the detail log you can see in the following :
> https://docs.google.com/document/d/1mbxrCy6mHHFxcxPv8f7CCA3BI1QVGPeNiHxUQhuZP0o/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)