[jira] [Created] (FLINK-15372) Use shorter config keys for FLIP-49 total memory config options

2019-12-23 Thread Xintong Song (Jira)
Xintong Song created FLINK-15372:


 Summary: Use shorter config keys for FLIP-49 total memory config 
options
 Key: FLINK-15372
 URL: https://issues.apache.org/jira/browse/FLINK-15372
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Xintong Song
 Fix For: 1.10.0


We propose to use shorter keys for total flink / process memory config options, 
to make it less clumsy without loss of expressiveness.

To be specific, we propose to:
* Change the config option key "taskmanager.memory.total-flink.size" to 
"taskmanager.memory.flink.size"
* Change the config option key "taskmanager.memory.total-process.size" to 
"taskmanager.memory.process.size"

Detailed discussion can be found in this [ML 
thread|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15367) Handle backwards compatibility of "taskmanager.heap.size" differently for standalone / active setups

2019-12-23 Thread Xintong Song (Jira)
Xintong Song created FLINK-15367:


 Summary: Handle backwards compatibility of "taskmanager.heap.size" 
differently for standalone / active setups
 Key: FLINK-15367
 URL: https://issues.apache.org/jira/browse/FLINK-15367
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Xintong Song
 Fix For: 1.10.0


Previously, "taskmanager.heap.size" were used differently for calculating TM 
memory sizes on standalone / active setups. To fully align with the previous 
behaviors, we need to map this deprecated key to 
"taskmanager.memory.flink.size" for standalone setups and 
"taskmanager.memory.process.size" for active setups.

Detailed discussion can be found in this [ML 
thread|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15371) Change FLIP-49 memory configurations to use the new memory type config options

2019-12-23 Thread Xintong Song (Jira)
Xintong Song created FLINK-15371:


 Summary: Change FLIP-49 memory configurations to use the new 
memory type config options
 Key: FLINK-15371
 URL: https://issues.apache.org/jira/browse/FLINK-15371
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Xintong Song
 Fix For: 1.10.0


FLIP-49 memory configurations can leverage the new strong typed ConfigOption, 
to make validation automatic and save from breaking the options later.

Detailed discussion can be found in this [ML 
thread|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15403) 'State Migration end-to-end test from 1.6' is unstable on travis.

2019-12-26 Thread Xintong Song (Jira)
Xintong Song created FLINK-15403:


 Summary: 'State Migration end-to-end test from 1.6' is unstable on 
travis.
 Key: FLINK-15403
 URL: https://issues.apache.org/jira/browse/FLINK-15403
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.0
Reporter: Xintong Song
 Fix For: 1.10.0


https://api.travis-ci.org/v3/job/629576631/log.txt

The test case fails because the log contains the following error message.
{code}
2019-12-26 09:19:35,537 ERROR 
org.apache.flink.streaming.runtime.tasks.StreamTask   - Received 
CancelTaskException while we are not canceled. This is a bug and should be 
reported
org.apache.flink.runtime.execution.CancelTaskException: Consumed partition 
PipelinedSubpartitionView(index: 0) of ResultPartition 
3886657fb8cc980139fac67e32d6e380@8cfcbe851fe3bb3fa00e9afc370bd963 has been 
released.
at 
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:190)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:509)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:487)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:475)
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75)
at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15382) Flink failed generating python config docs

2019-12-24 Thread Xintong Song (Jira)
Xintong Song created FLINK-15382:


 Summary: Flink failed generating python config docs 
 Key: FLINK-15382
 URL: https://issues.apache.org/jira/browse/FLINK-15382
 Project: Flink
  Issue Type: Bug
  Components: API / Python, Runtime / Configuration
Reporter: Xintong Song


When generating config option docs with the command suggested by 
{{flink-docs/README.md}}, the generated 
{{docs/_includes/generated/python_configuration.html}} does not contain any 
config options despite that there are 4 options in {{PythonOptions}}. 

I encountered this problem at the commit 
{{545534e43ed37f518fe59b6ddd8ed56ae82a234b}} on master branch.

Command used to generate doc:
{code:bash}mvn package -Dgenerate-config-docs -pl flink-docs -am -nsu 
-DskipTests{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15382) Flink failed generating python config docs

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002771#comment-17002771
 ] 

Xintong Song commented on FLINK-15382:
--

cc [~sunjincheng121] 
Since other components do not have the same problem, I guess this is probably 
some python related problem. Could you take a look?
Also, this issue doesn't seems to be a 1.10 blocker to me, but I'm not 
completely sure about this. Please update the priority and fix version if you 
feel it's necessary.

> Flink failed generating python config docs 
> ---
>
> Key: FLINK-15382
> URL: https://issues.apache.org/jira/browse/FLINK-15382
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python, Runtime / Configuration
>Reporter: Xintong Song
>Priority: Major
>
> When generating config option docs with the command suggested by 
> {{flink-docs/README.md}}, the generated 
> {{docs/_includes/generated/python_configuration.html}} does not contain any 
> config options despite that there are 4 options in {{PythonOptions}}. 
> I encountered this problem at the commit 
> {{545534e43ed37f518fe59b6ddd8ed56ae82a234b}} on master branch.
> Command used to generate doc:
> {code:bash}mvn package -Dgenerate-config-docs -pl flink-docs -am -nsu 
> -DskipTests{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15369) MiniCluster use fixed network / managed memory sizes by defualt

2019-12-23 Thread Xintong Song (Jira)
Xintong Song created FLINK-15369:


 Summary: MiniCluster use fixed network / managed memory sizes by 
defualt
 Key: FLINK-15369
 URL: https://issues.apache.org/jira/browse/FLINK-15369
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Xintong Song
 Fix For: 1.10.0


Currently, Mini Cluster may allocate off-heap memory (managed & network) 
according to the JVM free heap size and configured off-heap fractions. This 
could lead to unnecessary large off-heap memory usage and unpredictable / 
hard-to-understand behaviors.

We believe a fix value for managed / network memory would be enough for a such 
a setup that runs Flink as a library.

Detailed discussion can be found in this [ML 
thread|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15375) Improve MemorySize to print / parse with better readability.

2019-12-23 Thread Xintong Song (Jira)
Xintong Song created FLINK-15375:


 Summary: Improve MemorySize to print / parse with better 
readability.
 Key: FLINK-15375
 URL: https://issues.apache.org/jira/browse/FLINK-15375
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Xintong Song
 Fix For: 1.10.0


* Print MemorySize with proper unit rather than tremendous number of bytes.
* Parse memory size in numbers instead of {{parse(xxx + "m")}}

Detailed discussion can be found in this [ML 
thread|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003934#comment-17003934
 ] 

Xintong Song edited comment on FLINK-15388 at 12/27/19 7:13 AM:


One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration.
 !屏幕快照 2019-12-27 下午3.05.36.png! 
This might be the reason that affects the heartbeats, because rest server need 
to access the rpc main thread.

I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.


was (Author: xintongsong):
One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration.

This might be the reason that affects the heartbeats, because rest server need 
to access the rpc main thread.
 !屏幕快照 2019-12-27 下午3.05.36.png! 
I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, 屏幕快照 
> 2019-12-27 下午3.05.36.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003877#comment-17003877
 ] 

Xintong Song commented on FLINK-15388:
--

If you look at the log file, there are quite some activities before the timeout 
happen, which indicates that the task manager is probably not blocked. Also the 
job operators does not operates on the taskmanager rpc main thread, thus should 
not block the heartbeats.

>From what you described, it still sounds like network problem to me. How do 
>you know the network is ok?

For the configuration, usually we would avoid having such large containers. But 
I'm not sure whether that is relevant to this heartbeat timeout problem.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: metrics.png, metrics.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003934#comment-17003934
 ] 

Xintong Song edited comment on FLINK-15388 at 12/27/19 7:12 AM:


One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration.

This might be the reason that affects the heartbeats, because rest server need 
to access the rpc main thread.
 !屏幕快照 2019-12-27 下午3.05.36.png! 
I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.


was (Author: xintongsong):
One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration. This might be the reason that 
affects the heartbeats, because rest server need to access the rpc main thread.

I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, 屏幕快照 
> 2019-12-27 下午3.05.36.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003934#comment-17003934
 ] 

Xintong Song commented on FLINK-15388:
--

One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration. This might be the reason that 
affects the heartbeats, because rest server need to access the rpc main thread.

I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15403) 'State Migration end-to-end test from 1.6' is unstable on travis.

2019-12-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003828#comment-17003828
 ] 

Xintong Song commented on FLINK-15403:
--

My bad. I restarted the failed travis stage, and it seems the restarted stage 
has overwritten the log with the same url.

> 'State Migration end-to-end test from 1.6' is unstable on travis.
> -
>
> Key: FLINK-15403
> URL: https://issues.apache.org/jira/browse/FLINK-15403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Critical
>  Labels: testability
> Fix For: 1.10.0
>
>
> https://api.travis-ci.org/v3/job/629576631/log.txt
> The test case fails because the log contains the following error message.
> {code}
> 2019-12-26 09:19:35,537 ERROR 
> org.apache.flink.streaming.runtime.tasks.StreamTask   - Received 
> CancelTaskException while we are not canceled. This is a bug and should be 
> reported
> org.apache.flink.runtime.execution.CancelTaskException: Consumed partition 
> PipelinedSubpartitionView(index: 0) of ResultPartition 
> 3886657fb8cc980139fac67e32d6e380@8cfcbe851fe3bb3fa00e9afc370bd963 has been 
> released.
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:190)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:509)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:487)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:475)
>   at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75)
>   at 
> org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
>   at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
>   at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003067#comment-17003067
 ] 

Xintong Song commented on FLINK-15388:
--

Could you share the complete log file?

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003057#comment-17003057
 ] 

Xintong Song commented on FLINK-15388:
--

Hi [~hiliuxg],
The exception "slot was removed" usually indicates that the corresponding task 
manager failed. We need to look into the task manager log to find out why it 
failed.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003178#comment-17003178
 ] 

Xintong Song commented on FLINK-15388:
--

I'm glad to help.
I'm closing this jira ticket since it is not a bug of flink.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003113#comment-17003113
 ] 

Xintong Song commented on FLINK-15388:
--

BTW, you have configured very large TM heap size ("taskmanager.heap.size: 
144gb"). It's probably the TM GC takes too long. But again, further information 
is needed to confirm the problem.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-15388.

Resolution: Invalid

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15382) Flink failed generating python config docs

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003099#comment-17003099
 ] 

Xintong Song commented on FLINK-15382:
--

[~dian.fu]
There's only flink-python jar of version 1.9-SNAPSHOT in my local repository.
Do we need any special argument to build the flink-python module when building 
flink?

> Flink failed generating python config docs 
> ---
>
> Key: FLINK-15382
> URL: https://issues.apache.org/jira/browse/FLINK-15382
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python, Runtime / Configuration
>Reporter: Xintong Song
>Priority: Major
>
> When generating config option docs with the command suggested by 
> {{flink-docs/README.md}}, the generated 
> {{docs/_includes/generated/python_configuration.html}} does not contain any 
> config options despite that there are 4 options in {{PythonOptions}}. 
> I encountered this problem at the commit 
> {{545534e43ed37f518fe59b6ddd8ed56ae82a234b}} on master branch.
> Command used to generate doc:
> {code:bash}mvn package -Dgenerate-config-docs -pl flink-docs -am -nsu 
> -DskipTests{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003111#comment-17003111
 ] 

Xintong Song commented on FLINK-15388:
--

Could you attach the complete JM and the problematic TM 
(bae00218c818157649eb9e3c533b86af) log files?
>From the error messages you posted, the slot is removed because of task 
>manager heartbeat timeout.
There are usually several reasons that may cause the heartbeat timeout.
* TM failure.
* Network problem between RM/TM.
* Either RM or TM suffers severe GC and could not response to the heartbeats.
It's hard to tell which is the cause in your case without further information.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15382) Flink failed generating python config docs

2019-12-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003120#comment-17003120
 ] 

Xintong Song commented on FLINK-15382:
--

[~dian.fu]
I do not have flink-python of 1.11-SNAPSHOT in my local repository, but I do 
find flink-python_2.11 of 1.11-SNAPSHOT in my local repository.
Also, if the dependency jar file is missing, the document building should fail 
instead of success with empty generated table.

> Flink failed generating python config docs 
> ---
>
> Key: FLINK-15382
> URL: https://issues.apache.org/jira/browse/FLINK-15382
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python, Runtime / Configuration
>Reporter: Xintong Song
>Priority: Major
>
> When generating config option docs with the command suggested by 
> {{flink-docs/README.md}}, the generated 
> {{docs/_includes/generated/python_configuration.html}} does not contain any 
> config options despite that there are 4 options in {{PythonOptions}}. 
> I encountered this problem at the commit 
> {{545534e43ed37f518fe59b6ddd8ed56ae82a234b}} on master branch.
> Command used to generate doc:
> {code:bash}mvn package -Dgenerate-config-docs -pl flink-docs -am -nsu 
> -DskipTests{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003151#comment-17003151
 ] 

Xintong Song commented on FLINK-15388:
--

It seems there was some network problem at the time of this failure.

I found the following error message in your taskmanager log.
{code}
2019-12-25 07:52:35.957 [flink-scheduler-1] WARN  
org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not send heartbeat 
to target with id 2981310fa47a49b417e51ed799ff77d4.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on [Actor[akka://flink/user/taskmanager_0#1008850236]] after [1 
ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.CallAsync".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[na:1.8.0_211]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[na:1.8.0_211]
at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:647) 
~[na:1.8.0_211]
at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632)
 ~[na:1.8.0_211]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) 
~[na:1.8.0_211]
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
 ~[na:1.8.0_211]
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at akka.dispatch.OnComplete.internal(Future.scala:258) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at akka.dispatch.OnComplete.internal(Future.scala:256) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
 ~[flink-dist_2.11-1.8.0.jar:1.8.0]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_211]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/taskmanager_0#1008850236]] after [1 ms]. 
Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.CallAsync".
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) 
~[flink-dist_2.11-1.8.0.jar:1.8.0]
... 9 common frames omitted
{code}

And in the jobmanager log about the same time range, the following error 
message with the same taskmanager id.
{code}
2019-12-25 07:51:48.640 [flink-akka.actor.default-dispatcher-24] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-17 - Association with remote system 
[akka.tcp://flink@10.1.209.158:27065] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.158:27065]] 
Caused by: [Connection refused: /10.1.209.158:27065]
2019-12-25 07:51:55.743 [flink-scheduler-1] ERROR 
o.a.f.r.r.handler.taskmanager.TaskManagerStdoutFileHandler  - Failed to 
transfer file from TaskExecutor bae00218c818157649eb9e3c533b86af.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 

[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003156#comment-17003156
 ] 

Xintong Song commented on FLINK-15388:
--

FYI, you can attache log files directly to this JIRA page, by clicking "More -> 
Attach Files" at the top of this page.
It's probably more friendly to the community participants who do not speak 
Chinese than the Baidu file sharing.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003424#comment-17003424
 ] 

Xintong Song commented on FLINK-15388:
--

Flink does not support assigning tasks to specified machines.
An effort has been made to avoid allocating tasks intensively on some of the 
machines (see FLINK-12122), but it is only applied to Flink 1.9 above.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: metrics.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15388) The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.

2019-12-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003416#comment-17003416
 ] 

Xintong Song commented on FLINK-15388:
--

Well, that's also possible. 

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> --
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>Reporter: hiliuxg
>Priority: Major
> Attachments: metrics.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15437) Start session with property of "-Dtaskmanager.memory.process.size" not work

2019-12-30 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005248#comment-17005248
 ] 

Xintong Song commented on FLINK-15437:
--

[~xiaojin.wy]
I think the problem is that you have '-n 20' in your command, which does not 
exist anymore.
Once the client see an unknown option, it will stop parsing the remaining 
options. Thus the '-D' memory option is not applied.
Could you verify whether removing '-n 20' fixes the problem?

> Start session with property of "-Dtaskmanager.memory.process.size" not work
> ---
>
> Key: FLINK-15437
> URL: https://issues.apache.org/jira/browse/FLINK-15437
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
>Reporter: xiaojin.wy
>Priority: Major
> Fix For: 1.10.0
>
>
> *The environment:*
> Yarn session cmd is as below, and the flink-conf.yaml has not the property of 
> "taskmanager.memory.process.size":
> export HADOOP_CLASSPATH=`hadoop classpath`;export 
> HADOOP_CONF_DIR=/dump/1/jenkins/workspace/Stream-Spark-3.4/env/hadoop_conf_dirs/blinktest2;
>  export BLINK_HOME=/dump/1/jenkins/workspace/test/blink_daily; 
> $BLINK_HOME/bin/yarn-session.sh -d -qu root.default -nm 'Session Cluster of 
> daily_regression_stream_spark_1.10' -jm 1024 -n 20 -s 10 
> -Dtaskmanager.memory.process.size=1024m
> *After execute the cmd above, there is a exception like this:*
> 2019-12-30 17:54:57,992 INFO  org.apache.hadoop.yarn.client.RMProxy   
>   - Connecting to ResourceManager at 
> z05c07224.sqa.zth.tbsite.net/11.163.188.36:8050
> 2019-12-30 17:54:58,182 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - Error while running the Flink session.
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> 
>  The program finished with the following exception:
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> *The flink-conf.yaml is :*
> jobmanager.rpc.address: localhost
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.memory.total-process.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> parallelism.default: 1
> jobmanager.execution.failover-strategy: region



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-14572) BlobsCleanupITCase failed on Travis

2020-01-05 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song reopened FLINK-14572:
--

This problem appears again in FLINK-15480.

> BlobsCleanupITCase failed on Travis
> ---
>
> Key: FLINK-14572
> URL: https://issues.apache.org/jira/browse/FLINK-14572
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Gary Yao
>Assignee: Yun Gao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> {noformat}
> java.lang.AssertionError: 
> Expected: is 
>  but: was 
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
> {noformat}
> https://api.travis-ci.com/v3/job/250445874/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14572) BlobsCleanupITCase failed on Travis

2020-01-05 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008494#comment-17008494
 ] 

Xintong Song edited comment on FLINK-14572 at 1/6/20 2:36 AM:
--

[~gaoyunhaii] 
This problem appears again in FLINK-15480.


was (Author: xintongsong):
This problem appears again in FLINK-15480.

> BlobsCleanupITCase failed on Travis
> ---
>
> Key: FLINK-14572
> URL: https://issues.apache.org/jira/browse/FLINK-14572
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Gary Yao
>Assignee: Yun Gao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> {noformat}
> java.lang.AssertionError: 
> Expected: is 
>  but: was 
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
> {noformat}
> https://api.travis-ci.com/v3/job/250445874/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15480) BlobsCleanupITCase is unstable on travis

2020-01-05 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-15480.

Resolution: Duplicate

> BlobsCleanupITCase is unstable on travis
> 
>
> Key: FLINK-15480
> URL: https://issues.apache.org/jira/browse/FLINK-15480
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> {code:java}
> 03:52:11.256 [ERROR] 
> testBlobServerCleanupFinishedJob(org.apache.flink.runtime.jobmanager.BlobsCleanupITCase)
>   Time elapsed: 298.556 s  <<< FAILURE!
> java.lang.AssertionError: 
> Expected: is 
>  but: was 
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15480) BlobsCleanupITCase is unstable on travis

2020-01-05 Thread Xintong Song (Jira)
Xintong Song created FLINK-15480:


 Summary: BlobsCleanupITCase is unstable on travis
 Key: FLINK-15480
 URL: https://issues.apache.org/jira/browse/FLINK-15480
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Xintong Song
 Fix For: 1.10.0


https://api.travis-ci.com/v3/job/272321636/log.txt
{code}
03:52:11.256 [ERROR] 
testBlobServerCleanupFinishedJob(org.apache.flink.runtime.jobmanager.BlobsCleanupITCase)
  Time elapsed: 298.556 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: is 
 but: was 
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15480) BlobsCleanupITCase is unstable on travis

2020-01-05 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-15480:
-
Description: 
{code:java}
03:52:11.256 [ERROR] 
testBlobServerCleanupFinishedJob(org.apache.flink.runtime.jobmanager.BlobsCleanupITCase)
  Time elapsed: 298.556 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: is 
 but: was 
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
{code}

  was:
https://api.travis-ci.com/v3/job/272321636/log.txt
{code}
03:52:11.256 [ERROR] 
testBlobServerCleanupFinishedJob(org.apache.flink.runtime.jobmanager.BlobsCleanupITCase)
  Time elapsed: 298.556 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: is 
 but: was 
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
{code}



> BlobsCleanupITCase is unstable on travis
> 
>
> Key: FLINK-15480
> URL: https://issues.apache.org/jira/browse/FLINK-15480
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> {code:java}
> 03:52:11.256 [ERROR] 
> testBlobServerCleanupFinishedJob(org.apache.flink.runtime.jobmanager.BlobsCleanupITCase)
>   Time elapsed: 298.556 s  <<< FAILURE!
> java.lang.AssertionError: 
> Expected: is 
>  but: was 
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
>   at 
> org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2020-01-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008727#comment-17008727
 ] 

Xintong Song commented on FLINK-15103:
--

Thanks for spending time investigating the regression, [~pnowojski].

I agree with you that this should not be are real operator performance 
regression, but rather some sort of initialization cost changes.

+1 for closing this issue.

> Performance regression on 3.12.2019 in various benchmarks
> -
>
> Key: FLINK-15103
> URL: https://issues.apache.org/jira/browse/FLINK-15103
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>
> Various benchmarks show a performance regression that happened on December 
> 3rd:
> [arrayKeyBy (probably the most easily 
> visible)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=arrayKeyBy=2=200=off=on=on]
>  
> [tupleKeyBy|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=tupleKeyBy=2=200=off=on=on]
>  
> [twoInputMapSink|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=twoInputMapSink=2=200=off=on=on]
>  [globalWindow (small 
> one)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=globalWindow=2=200=off=on=on]
>  and possible others.
> Probably somewhere between those commits: -8403fd4- 2d67ee0..60b3f2f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15488) Cannot start a taskmanger if using logback

2020-01-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009331#comment-17009331
 ] 

Xintong Song edited comment on FLINK-15488 at 1/7/20 3:13 AM:
--

Thanks for reporting this issue, [~dwysakowicz]. This is indeed a design defect.

I think the simplest solution might be {{BashJavaUtils}} output the result with 
a specific key or pattern, so that {{taskmanager.sh} can find result among 
other potential outputs.

Or we can write the results into a temporal file. This is safer but a bit more 
complicate. We need to have unique per-TM file paths to avoid potential 
conflicts between multiple TMs, and also handle the file clean-ups

I think the first approach might be good enough. WDYT? [~azagrebin] 
[~dwysakowicz] [~gjy]


was (Author: xintongsong):
Thanks for reporting this issue, [~dwysakowicz]. This is indeed a design defect.

I think the simplest solution might be {{BashJavaUtils}} output the result with 
a specific key or pattern, so that {{taskmanager.sh} can find result among 
other potential outputs.

Or we can write the results into a temporal file. This is safer but a bit more 
complicate. We need to have unique per-TM file paths to avoid potential 
conflicts between multiple TMs, and also handle the file clean-ups

I think the first approach might be good enough. WDYT? 
[~azagrebin][~dwysakowicz][~gjy]

> Cannot start a taskmanger if using logback
> --
>
> Key: FLINK-15488
> URL: https://issues.apache.org/jira/browse/FLINK-15488
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Deployment / Scripts
>Affects Versions: 1.10.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
> Fix For: 1.10.0
>
>
> When using logback it is not possible to start the taskmanager using 
> {{taskamanger.sh}} scripts. The same problem (probably) occurs when using 
> slf4j that logs into the console.
> The problem is that when calculating memory configuration with 
> {{BashJavaUtils}} class the result is returned through standard output. If 
> something is logged into the console it may result in undefined behavior such 
> as e.g. 
> {code}
> Error: Could not find or load main class 13:51:23.961
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15488) Cannot start a taskmanger if using logback

2020-01-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009331#comment-17009331
 ] 

Xintong Song commented on FLINK-15488:
--

Thanks for reporting this issue, [~dwysakowicz]. This is indeed a design defect.

I think the simplest solution might be {{BashJavaUtils}} output the result with 
a specific key or pattern, so that {{taskmanager.sh} can find result among 
other potential outputs.

Or we can write the results into a temporal file. This is safer but a bit more 
complicate. We need to have unique per-TM file paths to avoid potential 
conflicts between multiple TMs, and also handle the file clean-ups

I think the first approach might be good enough. WDYT? 
[~azagrebin][~dwysakowicz][~gjy]

> Cannot start a taskmanger if using logback
> --
>
> Key: FLINK-15488
> URL: https://issues.apache.org/jira/browse/FLINK-15488
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Deployment / Scripts
>Affects Versions: 1.10.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
> Fix For: 1.10.0
>
>
> When using logback it is not possible to start the taskmanager using 
> {{taskamanger.sh}} scripts. The same problem (probably) occurs when using 
> slf4j that logs into the console.
> The problem is that when calculating memory configuration with 
> {{BashJavaUtils}} class the result is returned through standard output. If 
> something is logged into the console it may result in undefined behavior such 
> as e.g. 
> {code}
> Error: Could not find or load main class 13:51:23.961
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

2020-01-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010451#comment-17010451
 ] 

Xintong Song commented on FLINK-13554:
--

We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused 
by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a 
blocker. But how do we fix the problem, and whether it needs to be fixed in 
1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid 
overlooking it before a decision being made.

> ResourceManager should have a timeout on starting new TaskExecutors.
> 
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

2020-01-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010460#comment-17010460
 ] 

Xintong Song commented on FLINK-13554:
--

IMO, I think a clean solution should be RM monitors a timeout for starting new 
TMs. But this approach includes introducing config options for the timeout, 
monitoring timeout asynchronously, properly un-monitoring on TM registration, 
which may not be suitable to add after the feature freeze. 

Also, it seems not to be a common case. We do not see any report of this bug 
from the users. We run into this problem (both this ticket and FLINK-15456) 
only when testing the stability of Flink with ChaosMonkey intentionally 
breaking the network connections.

Therefore, I'm in favor of not fixing this problem in release 1.10.0.

> ResourceManager should have a timeout on starting new TaskExecutors.
> 
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

2020-01-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010451#comment-17010451
 ] 

Xintong Song edited comment on FLINK-13554 at 1/8/20 8:08 AM:
--

We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused 
by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a 
blocker. But how do we fix the problem, and whether it needs to be fixed in 
1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid 
overlooking it before a decision being made.
cc [~gjy] [~liyu] [~zhuzh] [~chesnay] [~trohrmann] [~karmagyz]


was (Author: xintongsong):
We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused 
by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a 
blocker. But how do we fix the problem, and whether it needs to be fixed in 
1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid 
overlooking it before a decision being made.

> ResourceManager should have a timeout on starting new TaskExecutors.
> 
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15456) Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

2020-01-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010440#comment-17010440
 ] 

Xintong Song commented on FLINK-15456:
--

Thanks [~zhuzh] for looking into the problem. I agree with you that this should 
be the same problem as FLINK-13554.
I'm closing this ticket as duplicated. Let's keep the discussion of how to fix 
this issue in FLINK-13554.

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> ---
>
> Key: FLINK-15456
> URL: https://issues.apache.org/jira/browse/FLINK-15456
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: jm.log, jm_part.log, jm_part2.log, tm_container_07.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15456) Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

2020-01-07 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-15456.

Resolution: Duplicate

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> ---
>
> Key: FLINK-15456
> URL: https://issues.apache.org/jira/browse/FLINK-15456
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: jm.log, jm_part.log, jm_part2.log, tm_container_07.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

2020-01-07 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-13554:
-
Fix Version/s: 1.10.0

> ResourceManager should have a timeout on starting new TaskExecutors.
> 
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

2020-01-07 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-13554:
-
Priority: Critical  (was: Major)

> ResourceManager should have a timeout on starting new TaskExecutors.
> 
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15488) Cannot start a taskmanger if using logback

2020-01-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010266#comment-17010266
 ] 

Xintong Song edited comment on FLINK-15488 at 1/8/20 2:17 AM:
--

I agree with [~azagrebin] and [~trohrmann] that the logs generated during 
{{BashJavaUtils}} should be preserved. Those logs contains information about 
how memory sizes are calculated from the configurations, and hints of how to 
fix an improper configuration, which are important to the users. I also agree 
that if the logs are configured to be outputted to stdout, we should also 
forward them to stdout of {{taskmanager.sh}}.

The example mentioned by [~dwysakowicz], 
{{OptimizerPlanEnvironment#getPipeline}}, inspires me with another approach. If 
separating logs and {{BashJavaUtils}} calculation results with regex is too 
fragile, maybe we can separate them inside {{BashJavaUtils}}, by overwriting 
the stdout.

To be specific, we can overwrite stdout to a {{ByteArrayOutputStream}} in 
{{BashJavaUtils#main}}, and make {{getTmResourceDynamicConfigs}} and 
{{getTmResourceJvmParams}} return string type results. Then we can make sure 
the calculation result is always outputted in the first or last line (it is 
already always the last line now but we can make it an explicit contract). If 
the result is not generated (say an exception is thrown during the 
calculation), we can use an empty line as a placeholder, and non-zero exit code 
should be returned so the shell script can learn that the result is invalid by 
checking the exit code. A {{try-catch(Throwable)-finally}} could be used to 
make sure the cached stdout will be outputted.


was (Author: xintongsong):
I agree with [~azagrebin] and [~trohrmann] that the logs generated during 
{{BashJavaUtils}} should be preserved. Those logs contains information about 
how memory sizes are calculated from the configurations, and hints of how to 
fix an improper configuration, which are important to the users. I also agree 
that if the logs are configured to be outputted to stdout, we should also 
forward them to stdout of {{taskmanager.sh}}.

The example mentioned by [~dwysakowicz], 
{{OptimizerPlanEnvironment#getPipeline}}, inspires me with another approach. If 
separating logs and {{BashJavaUtils}} calculation results with regex is too 
fragile, maybe we can separate them inside {{BashJavaUtils}}, by overwriting 
the stdout.

To be specific, we can overwrite stdout to a {{ByteArrayOutputStream}}, and 
make {{getTmResourceDynamicConfigs}} and {{getTmResourceJvmParams}} return 
string type results. Then we can make sure the calculation result is always 
outputted in the first or last line (it is already always the last line now but 
we can make it an explicit contract). If the result is not generated (say an 
exception is thrown during the calculation), we can use an empty line as a 
placeholder, and non-zero exit code should be returned so the shell script can 
learn that the result is invalid by checking the exit code. A 
{{try-catch(Throwable)-finally}} could be used to make sure the cached stdout 
will be outputted.

> Cannot start a taskmanger if using logback
> --
>
> Key: FLINK-15488
> URL: https://issues.apache.org/jira/browse/FLINK-15488
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Deployment / Scripts
>Affects Versions: 1.10.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
> Fix For: 1.10.0
>
>
> When using logback it is not possible to start the taskmanager using 
> {{taskamanger.sh}} scripts. The same problem (probably) occurs when using 
> slf4j that logs into the console.
> The problem is that when calculating memory configuration with 
> {{BashJavaUtils}} class the result is returned through standard output. If 
> something is logged into the console it may result in undefined behavior such 
> as e.g. 
> {code}
> Error: Could not find or load main class 13:51:23.961
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15488) Cannot start a taskmanger if using logback

2020-01-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010266#comment-17010266
 ] 

Xintong Song commented on FLINK-15488:
--

I agree with [~azagrebin] and [~trohrmann] that the logs generated during 
{{BashJavaUtils}} should be preserved. Those logs contains information about 
how memory sizes are calculated from the configurations, and hints of how to 
fix an improper configuration, which are important to the users. I also agree 
that if the logs are configured to be outputted to stdout, we should also 
forward them to stdout of {{taskmanager.sh}}.

The example mentioned by [~dwysakowicz], 
{{OptimizerPlanEnvironment#getPipeline}}, inspires me with another approach. If 
separating logs and {{BashJavaUtils}} calculation results with regex is too 
fragile, maybe we can separate them inside {{BashJavaUtils}}, by overwriting 
the stdout.

To be specific, we can overwrite stdout to a {{ByteArrayOutputStream}}, and 
make {{getTmResourceDynamicConfigs}} and {{getTmResourceJvmParams}} return 
string type results. Then we can make sure the calculation result is always 
outputted in the first or last line (it is already always the last line now but 
we can make it an explicit contract). If the result is not generated (say an 
exception is thrown during the calculation), we can use an empty line as a 
placeholder, and non-zero exit code should be returned so the shell script can 
learn that the result is invalid by checking the exit code. A 
{{try-catch(Throwable)-finally}} could be used to make sure the cached stdout 
will be outputted.

> Cannot start a taskmanger if using logback
> --
>
> Key: FLINK-15488
> URL: https://issues.apache.org/jira/browse/FLINK-15488
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Deployment / Scripts
>Affects Versions: 1.10.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
> Fix For: 1.10.0
>
>
> When using logback it is not possible to start the taskmanager using 
> {{taskamanger.sh}} scripts. The same problem (probably) occurs when using 
> slf4j that logs into the console.
> The problem is that when calculating memory configuration with 
> {{BashJavaUtils}} class the result is returned through standard output. If 
> something is logged into the console it may result in undefined behavior such 
> as e.g. 
> {code}
> Error: Could not find or load main class 13:51:23.961
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15448) Log host informations for TaskManager failures.

2020-01-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010296#comment-17010296
 ] 

Xintong Song commented on FLINK-15448:
--

[~trohrmann]
If we are touching the distributed component IDs, I would suggest to make it a 
FLIP, or at least a ML discussion thread with a design doc. I can see several 
issues need to be discussed, and there could be more.
* What information exactly do we want to include in the IDs?
* What information should be used to identify the machine where the component 
is running, especially in containerized environment like Kubernetes.
* Could the new "ID mechanism" be used to solve the problem of pending slot 
matching? Currently the resource profile based matching is fragile and already 
caused us many problems.
* The ongoing FLIP-56 should also be take into consideration, where we are 
attempting to remove the SlotID.

> Log host informations for TaskManager failures.
> ---
>
> Key: FLINK-15448
> URL: https://issues.apache.org/jira/browse/FLINK-15448
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.1
>Reporter: Victor Wong
>Assignee: Victor Wong
>Priority: Minor
>
> With Flink on Yarn, sometimes we ran into an exception like this:
> {code:java}
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 
> container_  timed out.
> {code}
> We'd like to find out the host of the lost TaskManager to log into it for 
> more details, we have to check the previous logs for the host information, 
> which is a little time-consuming.
> Maybe we can add more descriptive information to ResourceID of Yarn 
> containers, e.g. "container_xxx@host_name:port_number".
> Here's the demo:
> {code:java}
> class ResourceID {
>   final String resourceId;
>   final String details;
>   public ResourceID(String resourceId) {
> this.resourceId = resourceId;
> this.details = resourceId;
>   }
>   public ResourceID(String resourceId, String details) {
> this.resourceId = resourceId;
> this.details = details;
>   }
>   public String toString() {
> return details;
>   } 
> }
> // in flink-yarn
> private void startTaskExecutorInContainer(Container container) {
>   final String containerIdStr = container.getId().toString();
>   final String containerDetail = container.getId() + "@" + 
> container.getNodeId();  
>   final ResourceID resourceId = new ResourceID(containerIdStr, 
> containerDetail);
>   ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15499) No debug log describes the host of a TM before any task is deployed to it in YARN mode

2020-01-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010277#comment-17010277
 ] 

Xintong Song commented on FLINK-15499:
--

[~zhuzh]
True, the host name is not logged ATM. 
I agree with you that it would be helpful to log the host information. I 
actually think this log should be at INFO level instead of DEBUG level, because 
it's very frequently used in trouble shootings and one line per TM should not 
be very redundant.
Could you please assign this ticket to me? I can provide a fix.

> No debug log describes the host of a TM before any task is deployed to it  in 
> YARN mode 
> 
>
> Key: FLINK-15499
> URL: https://issues.apache.org/jira/browse/FLINK-15499
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
>
> When troubleshooting FLINK-15456, I noticed a TM hang in starting and not 
> able to register to RM. However, there is no debug log on which host the TM 
> located on and thus I can hardly find the logs of the problematic TM.
> I think we should print the host name when starting a TM, i.e. in this logs
> "TaskExecutor container_ will be started ...".
> This would make it possible for us to troubleshoot similar problems. (not 
> only for cases that TM hang in starting, but also for cases that TM exits in 
> starting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15925) TaskExecutors don't work out-of-the-box on Windows

2020-03-10 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-15925.

Resolution: Won't Fix

According to the [ML 
discussion|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Extend-or-maintain-quot-shell-quot-script-support-for-Windows-tc37868.html],
 the Windows scripts will no longer be maintained.

> TaskExecutors don't work out-of-the-box on Windows
> --
>
> Key: FLINK-15925
> URL: https://issues.apache.org/jira/browse/FLINK-15925
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Assignee: Xintong Song
>Priority: Major
>
> {code}
> org.apache.flink.configuration.IllegalConfigurationException: Failed to 
> create TaskExecutorResourceSpec
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.(TaskManagerRunner.java:152)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerSecurely$2(TaskManagerRunner.java:322)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerSecurely(TaskManagerRunner.java:321)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:287)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> Caused by: org.apache.flink.configuration.IllegalConfigurationException: The 
> required configuration option Key: 'taskmanager.cpu.cores' , default: null 
> (fallback keys: []) is not set
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at java.util.Arrays$ArrayList.forEach(Arrays.java:4390) ~[?:?]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>   ... 7 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-16563.

Resolution: Not A Problem

> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser}} 
> is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove {{args}} from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16440) Extend SlotManager metrics and status for dynamic slot allocation.

2020-03-09 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16440:
-
Description: 
* Create a slotManagerMetricGroup in resourceManagerMetricGroup, pass it into 
SM and register slot related metrics there.
** This allows registering different metrics for different SM implementation.
** For backwards compatibility, the slotManagerMetricGroup should have the same 
path as the resourceManagerMetricGroup.
* Extend ResourceOverview and TaskManagerInfo to contain TM total / free / 
allocated resources.
** Need to add methods to SM for getting TM resource status.
** For SlotManagerImpl,
*** The existing methods for getting number of registered / free slots need no 
changes.
*** TM resource status can be computed from TaskExecutorProcessSpec, slot 
profiles and number of free slots.

  was:
* Create a slotManagerMetricGroup in resourceManagerMetricGroup, pass it into 
SM and register slot related metrics there.
 * This allows registering different metrics for different SM implementation.
 * For backwards compatibility, the slotManagerMetricGroup should have the same 
path as the resourceManagerMetricGroup.


 * Extend ResourceOverview and TaskManagerInfo to contain TM total / free / 
allocated resources.
 * Need to add methods to SM for getting TM resource status.
 * For SlotManagerImpl,
 * The existing methods for getting number of registered / free slots need no 
changes.
 * TM resource status can be computed from TaskExecutorProcessSpec, slot 
profiles and number of free slots.


> Extend SlotManager metrics and status for dynamic slot allocation.
> --
>
> Key: FLINK-16440
> URL: https://issues.apache.org/jira/browse/FLINK-16440
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> * Create a slotManagerMetricGroup in resourceManagerMetricGroup, pass it into 
> SM and register slot related metrics there.
> ** This allows registering different metrics for different SM implementation.
> ** For backwards compatibility, the slotManagerMetricGroup should have the 
> same path as the resourceManagerMetricGroup.
> * Extend ResourceOverview and TaskManagerInfo to contain TM total / free / 
> allocated resources.
> ** Need to add methods to SM for getting TM resource status.
> ** For SlotManagerImpl,
> *** The existing methods for getting number of registered / free slots need 
> no changes.
> *** TM resource status can be computed from TaskExecutorProcessSpec, slot 
> profiles and number of free slots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16547) Correct the order to write temporary files in YarnClusterDescriptor#startAppMaster

2020-03-11 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16547:
-
Summary: Correct the order to write temporary files in 
YarnClusterDescriptor#startAppMaster  (was: Corrent the order to write 
temporary files in YarnClusterDescriptor#startAppMaster)

> Correct the order to write temporary files in 
> YarnClusterDescriptor#startAppMaster
> --
>
> Key: FLINK-16547
> URL: https://issues.apache.org/jira/browse/FLINK-16547
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Canbin Zheng
>Priority: Minor
> Fix For: 1.11.0
>
>
> Currently, in {{YarnClusterDescriptor#startAppMaster}}, we first write out 
> and upload the Flink Configuration file, then start to write out the JobGraph 
> file and set its name into the Flink Configuration object, the afterward 
> setting is not written into the Flink Configuration file so that it does not 
> take effect in the cluster side.
> Since in the client-side we name the JobGraph file with the default value of 
> FileJobGraphRetriever.JOB_GRAPH_FILE_PATH option, the cluster side could 
> succeed in retrieving that file. 
> This ticket proposes to write out the JobGraph file before the Configuration 
> file to ensure that the setting of FileJobGraphRetriever.JOB_GRAPH_FILE_PATH 
> is delivered to the cluster side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057679#comment-17057679
 ] 

Xintong Song commented on FLINK-16563:
--

[~tison],

Correct me if I'm wrong, but I don't see how the "args" generated by 
{{CommandLineParser}} are exposed to the user.

I guess what you are talking about is the user main args on the client side? 
While what I'm talking about is the {{CommandLineParser}} args on the cluster 
side.

I'm not sure whether user's main args are passed to the {{ClusterEntrypoint}}. 
If they are, maybe we should change that as well, since they are not used on 
the cluster side anyway.

> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser}} 
> is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove {{args}} from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16563:
-
Description: 
Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
unrecognized option, leaving the remaining tokens to "args" rather than 
"options".

This sometimes lead to problems due to absence of subsequence options, and the 
error messages do not point to the true root cause. 
[Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
 reported in the user ML.

I've checked and it seems that the "args" generated by {{CommandLineParser}} is 
not really used anywhere. Therefore, I propose to make the parser fail fast 
with explicit error message at unrecognized tokens.

The proposed changes are basically as follows:
 * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
argument  {{stopAtNonOption}} set to {{false}}.
 * Remove {{args}} from {{ClusterConfiguration}} and its sub-classes.

  was:
Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
unrecognized option, leaving the remaining tokens to "args" rather than 
"options".

This sometimes lead to problems due to absence of subsequence options, and the 
error messages do not point to the true root cause. 
[Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
 reported in the user ML.

I've checked and it seems that the "args" generated by {{CommandLineParser }}is 
not really used anywhere. Therefore, I propose to make the parser fail fast 
with explicit error message at unrecognized tokens.

The proposed changes are basically as follows:
 * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
argument  {{stopAtNonOption}} set to {{false}}.
 * Remove args from {{ClusterConfiguration}} and its sub-classes.


> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser}} 
> is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove {{args}} from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16563:
-
Description: 
Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
unrecognized option, leaving the remaining tokens to "args" rather than 
"options".

This sometimes lead to problems due to absence of subsequence options, and the 
error messages do not point to the true root cause. 
[Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
 reported in the user ML.

I've checked and it seems that the "args" generated by {{CommandLineParser }}is 
not really used anywhere. Therefore, I propose to make the parser fail fast 
with explicit error message at unrecognized tokens.

The proposed changes are basically as follows:
 * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
argument  {{stopAtNonOption}} set to {{false}}.
 * Remove args from {{ClusterConfiguration}} and its sub-classes.

  was:
Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
unrecognized option, leaving the remaining tokens to "args" rather than 
"options".

This sometimes lead to problems due to absence of subsequence options, and the 
error messages do not point to the true root cause. 
[Example|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]]
 reported in the user ML.

I've checked and it seems that the "args" generated by {{CommandLineParser }}is 
not really used anywhere. Therefore, I propose to make the parser fail fast 
with explicit error message at unrecognized tokens.

The proposed changes are basically as follows:
 * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
argument  {{stopAtNonOption}} set to {{false}}.
 * Remove args from {{ClusterConfiguration}} and its sub-classes.


> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser 
> }}is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove args from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16563:
-
Summary: CommandLineParser should fail with explicit error message when 
parsing un-recognized arguments.  (was: CommandLineParser should fail with 
explicit error message when parsing recognized arguments.)

> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser 
> }}is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove args from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing recognized arguments.

2020-03-12 Thread Xintong Song (Jira)
Xintong Song created FLINK-16563:


 Summary: CommandLineParser should fail with explicit error message 
when parsing recognized arguments.
 Key: FLINK-16563
 URL: https://issues.apache.org/jira/browse/FLINK-16563
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Affects Versions: 1.10.0
Reporter: Xintong Song
 Fix For: 1.11.0


Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
unrecognized option, leaving the remaining tokens to "args" rather than 
"options".

This sometimes lead to problems due to absence of subsequence options, and the 
error messages do not point to the true root cause. 
[Example|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]]
 reported in the user ML.

I've checked and it seems that the "args" generated by {{CommandLineParser }}is 
not really used anywhere. Therefore, I propose to make the parser fail fast 
with explicit error message at unrecognized tokens.

The proposed changes are basically as follows:
 * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
argument  {{stopAtNonOption}} set to {{false}}.
 * Remove args from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16563) CommandLineParser should fail with explicit error message when parsing un-recognized arguments.

2020-03-12 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057686#comment-17057686
 ] 

Xintong Song commented on FLINK-16563:
--

[~tison],
Sorry, my bad. I was searching the usage of "args" in the testing scope.
You are right, the user program args are indeed used for generating the job 
graph in Job Clusters.

> CommandLineParser should fail with explicit error message when parsing 
> un-recognized arguments.
> ---
>
> Key: FLINK-16563
> URL: https://issues.apache.org/jira/browse/FLINK-16563
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, {{CommandLineParser}} will stop parsing silently if it meets an 
> unrecognized option, leaving the remaining tokens to "args" rather than 
> "options".
> This sometimes lead to problems due to absence of subsequence options, and 
> the error messages do not point to the true root cause. 
> [Example|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-1-10-container-memory-configuration-with-Mesos-td33594.html]
>  reported in the user ML.
> I've checked and it seems that the "args" generated by {{CommandLineParser}} 
> is not really used anywhere. Therefore, I propose to make the parser fail 
> fast with explicit error message at unrecognized tokens.
> The proposed changes are basically as follows:
>  * In {{CommandLineParser#parse}}, call {{DefaultParser#parse}} with the 
> argument  {{stopAtNonOption}} set to {{false}}.
>  * Remove {{args}} from {{ClusterConfiguration}} and its sub-classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16467) MemorySizeTest#testToHumanReadableString() is not portable

2020-03-08 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16467:
-
Fix Version/s: 1.11.0
   1.10.1

> MemorySizeTest#testToHumanReadableString() is not portable
> --
>
> Key: FLINK-16467
> URL: https://issues.apache.org/jira/browse/FLINK-16467
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
> Environment: $ mvn -v
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
> 2018-06-17T20:33:14+02:00)
> Maven home: /usr/local/apache-maven-3.5.4
> Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: 
> /usr/local/openjdk8/jre
> Default locale: de_DE, platform encoding: UTF-8
> OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"
>Reporter: Michael Osipov
>Assignee: Xintong Song
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> Runing this test from master gives me:
> {noformat}
> [ERROR] Failures:
> [ERROR]   MemorySizeTest.testToHumanReadableString:242
> Expected: is "1.001kb (1025 bytes)"
>  but: was "1,001kb (1025 bytes)"
> [INFO]
> {noformat}
> The test is not locale-portable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16489) flink on yarn模式下,rm重启am,没有从失败的am保存的chk来重启

2020-03-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054641#comment-17054641
 ] 

Xintong Song commented on FLINK-16489:
--

Hi [~liangji],

Thanks for reporting this bug.

Could you please translate this ticket into English? The Apache Flink community 
international, and there are lots of contributors who does not speak Chinese. 
If you have problems reporting the issue in English, you can post in Chinese in 
the user-zh mailing list ([mailto:user...@flink.apache.org]).

Moreover, it would be good to provide more information about your problems, 
e.g., logs, your Flink version, and the description of your environment. 
Otherwise, it would be hard for people to help you.

> flink on yarn模式下,rm重启am,没有从失败的am保存的chk来重启
> -
>
> Key: FLINK-16489
> URL: https://issues.apache.org/jira/browse/FLINK-16489
> Project: Flink
>  Issue Type: Bug
>Reporter: liangji
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16467) MemorySizeTest#testToHumanReadableString() is not portable

2020-03-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054644#comment-17054644
 ] 

Xintong Song commented on FLINK-16467:
--

[~michael-o] Thanks for reporting this issue.

I think this is probably due to the locale issue. We use {{String.format}} for 
generating the human readable string for {{MemorySize}}, and I find this 
[post|https://stackoverflow.com/questions/5236056/force-point-as-decimal-separator-in-java]
 that describes the problem and how to fix it.

> MemorySizeTest#testToHumanReadableString() is not portable
> --
>
> Key: FLINK-16467
> URL: https://issues.apache.org/jira/browse/FLINK-16467
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
> Environment: $ mvn -v
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
> 2018-06-17T20:33:14+02:00)
> Maven home: /usr/local/apache-maven-3.5.4
> Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: 
> /usr/local/openjdk8/jre
> Default locale: de_DE, platform encoding: UTF-8
> OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"
>Reporter: Michael Osipov
>Priority: Major
>
> Runing this test from master gives me:
> {noformat}
> [ERROR] Failures:
> [ERROR]   MemorySizeTest.testToHumanReadableString:242
> Expected: is "1.001kb (1025 bytes)"
>  but: was "1,001kb (1025 bytes)"
> [INFO]
> {noformat}
> The test is not locale-portable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16467) MemorySizeTest#testToHumanReadableString() is not portable

2020-03-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054658#comment-17054658
 ] 

Xintong Song commented on FLINK-16467:
--

[~zhuzh], thanks for assigning me, I've opened a PR. Please take a look at your 
convenience.

[~michael-o], could you please also help confirm whether the PR fix the problem 
in your environment? Thanks.

> MemorySizeTest#testToHumanReadableString() is not portable
> --
>
> Key: FLINK-16467
> URL: https://issues.apache.org/jira/browse/FLINK-16467
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
> Environment: $ mvn -v
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
> 2018-06-17T20:33:14+02:00)
> Maven home: /usr/local/apache-maven-3.5.4
> Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: 
> /usr/local/openjdk8/jre
> Default locale: de_DE, platform encoding: UTF-8
> OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"
>Reporter: Michael Osipov
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Runing this test from master gives me:
> {noformat}
> [ERROR] Failures:
> [ERROR]   MemorySizeTest.testToHumanReadableString:242
> Expected: is "1.001kb (1025 bytes)"
>  but: was "1,001kb (1025 bytes)"
> [INFO]
> {noformat}
> The test is not locale-portable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16467) MemorySizeTest#testToHumanReadableString() is not portable

2020-03-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054802#comment-17054802
 ] 

Xintong Song commented on FLINK-16467:
--

Thanks for the input, [~michael-o].

> MemorySizeTest#testToHumanReadableString() is not portable
> --
>
> Key: FLINK-16467
> URL: https://issues.apache.org/jira/browse/FLINK-16467
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
> Environment: $ mvn -v
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
> 2018-06-17T20:33:14+02:00)
> Maven home: /usr/local/apache-maven-3.5.4
> Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: 
> /usr/local/openjdk8/jre
> Default locale: de_DE, platform encoding: UTF-8
> OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"
>Reporter: Michael Osipov
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Runing this test from master gives me:
> {noformat}
> [ERROR] Failures:
> [ERROR]   MemorySizeTest.testToHumanReadableString:242
> Expected: is "1.001kb (1025 bytes)"
>  but: was "1,001kb (1025 bytes)"
> [INFO]
> {noformat}
> The test is not locale-portable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17061) Unset process/flink memory size from configuration once dynamic worker resource is activated.

2020-04-08 Thread Xintong Song (Jira)
Xintong Song created FLINK-17061:


 Summary: Unset process/flink memory size from configuration once 
dynamic worker resource is activated.
 Key: FLINK-17061
 URL: https://issues.apache.org/jira/browse/FLINK-17061
 Project: Flink
  Issue Type: Task
  Components: Runtime / Configuration, Runtime / Coordination
Affects Versions: 1.11.0
Reporter: Xintong Song


With FLINK-14106, memory of a TaskExecutor is decided in two steps on active 
resource managers.
- {{SlotManager}} decides {{WorkerResourceSpec}}, including memory used by 
Flink tasks: task heap, task off-heap, network and managed memory.
- {{ResourceManager}} derives {{TaskExecutorProcessSpec}} from 
{{WorkerResourceSpec}} and the configuration, deciding sizes of memory used by 
Flink framework and JVM: framework heap, framework off-heap, jvm metaspace and 
jvm overhead.

This works fine for now, because both {{WorkerResourceSpec}} and 
{{TaskExecutorProcessSpec}} are derived from the same configurations. However, 
it might cause problem if later we have new {{SlotManager}} implementations 
that decides {{WorkerResourceSpec}} dynamically. In such cases, the 
process/flink sizes in configuration should be ignored, or it may easily lead 
to configuration conflicts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16733) Refactor YarnClusterDescriptor

2020-04-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084513#comment-17084513
 ] 

Xintong Song commented on FLINK-16733:
--

[~rongr]

I do not have a concrete plan.

At high level, my idea is similar to what [~tison] has described.
* We will have some decorators, each decorator dedicates to preparing a 
specific component / functionality of Flink. E.g., high availability, security, 
resource management, etc.
* To prepare the component / functionality, the decorators need to take various 
actions: upload files, write configurations, set environment variables, add 
class path, etc.
* Dedicated utils will be introduced for the decorating actions. 
`YarnClusterDescriptor` should be responsible for preparing these utils, 
passing them to the decorators, and assembling them to create the container 
launch context.

I have already started working on this, not full time though. My progress so 
far:
* Introduced ClasspathBuilder for building the class path, dealing with the 
ordering of user/framework class paths.
* Introduced ShipFileUtils for uploading files (still migrating tests)

> Refactor YarnClusterDescriptor
> --
>
> Key: FLINK-16733
> URL: https://issues.apache.org/jira/browse/FLINK-16733
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Minor
>
> Currently, YarnClusterDescriptor is not in a good shape. It has 1600+ lines 
> of codes, of which the method {{startAppMaster}} alone has 400+ codes, 
> leading to poor maintainability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17257) AbstractYarnClusterTest does not compile with Hadoop 2.10

2020-04-20 Thread Xintong Song (Jira)
Xintong Song created FLINK-17257:


 Summary: AbstractYarnClusterTest does not compile with Hadoop 2.10
 Key: FLINK-17257
 URL: https://issues.apache.org/jira/browse/FLINK-17257
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN, Tests
Affects Versions: 1.9.3, 1.10.1, 1.11.0
Reporter: Xintong Song
 Fix For: 1.11.0, 1.10.2, 1.9.4


In {{AbstractYarnClusterTest}}, we create {{ApplicationReport}} with the static 
method {{ApplicationReport.newInstance}}, which is annotated as private and 
unstable. This method is no longer compatible in Hadoop 2.10.

As a workaround, we can create {{ApplicationReport}} with its default 
constructor and set only the fields that we need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17257) AbstractYarnClusterTest does not compile with Hadoop 2.10

2020-04-20 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-17257:
-
Affects Version/s: (was: 1.10.1)
   (was: 1.9.3)

> AbstractYarnClusterTest does not compile with Hadoop 2.10
> -
>
> Key: FLINK-17257
> URL: https://issues.apache.org/jira/browse/FLINK-17257
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> In {{AbstractYarnClusterTest}}, we create {{ApplicationReport}} with the 
> static method {{ApplicationReport.newInstance}}, which is annotated as 
> private and unstable. This method is no longer compatible in Hadoop 2.10.
> As a workaround, we can create {{ApplicationReport}} with its default 
> constructor and set only the fields that we need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17257) AbstractYarnClusterTest does not compile with Hadoop 2.10

2020-04-20 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-17257:
-
Fix Version/s: (was: 1.9.4)
   (was: 1.10.2)

> AbstractYarnClusterTest does not compile with Hadoop 2.10
> -
>
> Key: FLINK-17257
> URL: https://issues.apache.org/jira/browse/FLINK-17257
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> In {{AbstractYarnClusterTest}}, we create {{ApplicationReport}} with the 
> static method {{ApplicationReport.newInstance}}, which is annotated as 
> private and unstable. This method is no longer compatible in Hadoop 2.10.
> As a workaround, we can create {{ApplicationReport}} with its default 
> constructor and set only the fields that we need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15648) Support to configure limit for CPU requirement

2020-04-20 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087666#comment-17087666
 ] 

Xintong Song commented on FLINK-15648:
--

[~felixzheng], sorry to jump into the discussion late.

I agree with you that it is a good thing to leverage the Kubernetes feature of 
container request/limit cpu. However, I'm not entirely sure about the proposed 
approach that introduces configuration options for the absolute cpu limit for 
JM/TM containers.

It might be ok with the current Flink resource management, where we have only 
two kinds of containers. But it may not work well with FLINK-14106, where we no 
longer assumes that all the TMs have the same amount of resources. With the 
efforts towards fine grained resource management, we might eventually have 
SlotManager plugin that dynamically decides TM resources based on the workload. 
In such cases, we may not have a proper `kubernetes.taskmanager.limit.cpu` 
value that works well with all TMs with different resources, and the 
SlotManager plugin should not be aware of the specific underlying resource 
manager (K8s, Yarn or Mesos) and decide the Kubernetes specific 'limit.cpu'.

Alternatively, we might have a configuration option 
'kubernetes.container.cpu-limit-ratio' or so, that calculate the 'limit.cpu' 
from the 'request.cpu'. E.g., if the requested cpu is 2.0 and the ratio is 1.5, 
then the limit would be 3.0. I think this gives you practically the same 
controllability as the proposed approach, while naturally works with dynamic TM 
resources.

WDYT?

> Support to configure limit for CPU requirement
> --
>
> Key: FLINK-15648
> URL: https://issues.apache.org/jira/browse/FLINK-15648
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current branch use kubernetes.xx.cpu to configure request and limit 
> resource requirement for a Container, it's an important improvement to 
> separate these two configurations, we can use kubernetes.xx.request.cpu and 
> kubernetes.xx.limit.cpu to specify request and limit resource 
> requirements.{color:#6a8759}
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16953) TableEnvHiveConnectorTest is unstable on travis.

2020-04-02 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16953:
-
Description: 
[https://api.travis-ci.org/v3/job/670405441/log.txt]


{code:java}
[INFO] Running org.apache.flink.connectors.hive.TableEnvHiveConnectorTest
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 12.67 s 
<<< FAILURE! - in org.apache.flink.connectors.hive.TableEnvHiveConnectorTest
[ERROR] org.apache.flink.connectors.hive.TableEnvHiveConnectorTest  Time 
elapsed: 12.669 s  <<< ERROR!
java.lang.IllegalStateException: Failed to create HiveServer :Failed to get 
metastore connection
Caused by: java.lang.RuntimeException: Failed to get metastore connection
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.reflect.InvocationTargetException
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: 
Could not connect to meta store using any of the URIs provided. Most recent 
failure: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused (Connection refused)
at org.apache.thrift.transport.TSocket.open(TSocket.java:226)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:480)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:247)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1706)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3894)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
at org.apache.hive.service.server.HiveServer2.init(HiveServer2.java:166)
at 
com.klarna.hiverunner.HiveServerContainer.init(HiveServerContainer.java:84)
at 
com.klarna.hiverunner.builder.HiveShellBase.start(HiveShellBase.java:165)
at 
org.apache.flink.connectors.hive.FlinkStandaloneHiveRunner.createHiveServerContainer(FlinkStandaloneHiveRunner.java:217)
at 
org.apache.flink.connectors.hive.FlinkStandaloneHiveRunner.access$600(FlinkStandaloneHiveRunner.java:92)
at 
org.apache.flink.connectors.hive.FlinkStandaloneHiveRunner$2.before(FlinkStandaloneHiveRunner.java:131)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 

[jira] [Created] (FLINK-16953) TableEnvHiveConnectorTest is unstable on travis.

2020-04-02 Thread Xintong Song (Jira)
Xintong Song created FLINK-16953:


 Summary: TableEnvHiveConnectorTest is unstable on travis.
 Key: FLINK-16953
 URL: https://issues.apache.org/jira/browse/FLINK-16953
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Affects Versions: 1.11.0
Reporter: Xintong Song


[https://api.travis-ci.org/v3/job/670405441/log.txt]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16874) Respect the dynamic options when calculating memory options in taskmanager.sh

2020-03-30 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071479#comment-17071479
 ] 

Xintong Song commented on FLINK-16874:
--

Thanks for reporting this issue, [~karmagyz].

I would not consider this as a bug, because the memory configurations are read 
before {{TaskManagerRunner}} is launched. However, I agree with you that it 
would be good to support memory configurations with dynamic properties for 
standalone clusters. IIUC, this does not require changes. The only thing we 
need to do is to forward the user arguments to {{BashJavaUtils}}.

> Respect the dynamic options when calculating memory options in taskmanager.sh
> -
>
> Key: FLINK-16874
> URL: https://issues.apache.org/jira/browse/FLINK-16874
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.10.0
>Reporter: Yangze Guo
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> Since FLINK-9821, the taskmanager.sh will pass user-defined dynamic options 
> to the TaskManagerRunner. However, in FLINK-13983, we calculate the 
> memory-related configuration only according to the FLINK_CONF_DIR. We then 
> append the calculation result as dynamic options to the TM, the user-defined 
> dynamic options would be overridden and ignored.
> The BashJavaUtils is already support loading dynamic options, we just need to 
> pass it in bash script.
> cc [~xintongsong] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16874) Respect the dynamic options when calculating memory options in taskmanager.sh

2020-03-30 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-16874:
-
Priority: Major  (was: Critical)

> Respect the dynamic options when calculating memory options in taskmanager.sh
> -
>
> Key: FLINK-16874
> URL: https://issues.apache.org/jira/browse/FLINK-16874
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.10.0
>Reporter: Yangze Guo
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> Since FLINK-9821, the taskmanager.sh will pass user-defined dynamic options 
> to the TaskManagerRunner. However, in FLINK-13983, we calculate the 
> memory-related configuration only according to the FLINK_CONF_DIR. We then 
> append the calculation result as dynamic options to the TM, the user-defined 
> dynamic options would be overridden and ignored.
> The BashJavaUtils is already support loading dynamic options, we just need to 
> pass it in bash script.
> cc [~xintongsong] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16798) Logs from BashJavaUtils are not properly preserved and passed into TM logs.

2020-03-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067455#comment-17067455
 ] 

Xintong Song commented on FLINK-16798:
--

cc [~karmagyz] [~chesnay]

> Logs from BashJavaUtils are not properly preserved and passed into TM logs.
> ---
>
> Key: FLINK-16798
> URL: https://issues.apache.org/jira/browse/FLINK-16798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Scripts
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.11.0
>
>
> With FLINK-15519, in the TM start-up scripts, we have captured logs from 
> {{BashJavaUtils}} and passed into the TM JVM process via environment 
> variable. These logs will be merged with other TM logs, writing to same 
> places respecting user's log configurations.
> This effort was broken in FLINK-15727, where the outputs from 
> {{BashJavaUtils}}  are thrown away, except for the result JVM parameters and 
> dynamic configurations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16798) Logs from BashJavaUtils are not properly preserved and passed into TM logs.

2020-03-26 Thread Xintong Song (Jira)
Xintong Song created FLINK-16798:


 Summary: Logs from BashJavaUtils are not properly preserved and 
passed into TM logs.
 Key: FLINK-16798
 URL: https://issues.apache.org/jira/browse/FLINK-16798
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Scripts
Affects Versions: 1.11.0
Reporter: Xintong Song
 Fix For: 1.11.0


With FLINK-15519, in the TM start-up scripts, we have captured logs from 
{{BashJavaUtils}} and passed into the TM JVM process via environment variable. 
These logs will be merged with other TM logs, writing to same places respecting 
user's log configurations.

This effort was broken in FLINK-15727, where the outputs from {{BashJavaUtils}} 
 are thrown away, except for the result JVM parameters and dynamic 
configurations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17403) Fix invalid classpath in BashJavaUtilsITCase

2020-04-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093301#comment-17093301
 ] 

Xintong Song commented on FLINK-17403:
--

BTW, would you like to work on this? [~Paul Lin]

> Fix invalid classpath in BashJavaUtilsITCase
> 
>
> Key: FLINK-17403
> URL: https://issues.apache.org/jira/browse/FLINK-17403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Paul Lin
>Priority: Major
> Fix For: 1.11.0, 1.10.2
>
>
> runBashJavaUtilsCmd.sh locates flink-dist.jar by `find` with pattern 
> `flink-dist*.jar`, but it doesn't filter out the flink-dist*-source.jar built 
> by maven and the flink-dist jar in the original bin directory, so it might 
> get 3 jars as the result, which might break the command depends on it.
> For instance, the result of `find` can be:
> ```
> project_dir/flink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0-sources.jar
> project_dir/flink-dist/src/test/bin/../../../target/flink-1.10.0-bin/flink-1.10.0/lib/flink-dist_2.11-1.10.0.jar
> project_dirflink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0.jar
> ```
> Moreover, there's a redundant `}` in the command, which seems to be 
> accidentally skipped by the multiple-line result provided by `find`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17403) Fix invalid classpath in BashJavaUtilsITCase

2020-04-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093297#comment-17093297
 ] 

Xintong Song commented on FLINK-17403:
--

[~aljoscha]
 Thanks for pulling me in.

[~Paul Lin]
 Thanks for reporting this issue. It seems to me this is a valid problem. 

As for the solution, I think we probably also consider the possibility that 
there are jar files with other suffix under {{$FLINK_TARGET_DIR}}. E.g., 
{{flink-dist*-test.jar}}. Of course we could further filter the patterns for 
the {{find}} command. However, I'm in favor of an alternative approach to 
properly add all jar files that meets the pattern to the class path. That safes 
us from closely depending on the certain pattern of the name of flink dist jar 
file. WDYT?

> Fix invalid classpath in BashJavaUtilsITCase
> 
>
> Key: FLINK-17403
> URL: https://issues.apache.org/jira/browse/FLINK-17403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0
>Reporter: Paul Lin
>Priority: Major
>
> runBashJavaUtilsCmd.sh locates flink-dist.jar by `find` with pattern 
> `flink-dist*.jar`, but it doesn't filter out the flink-dist*-source.jar built 
> by maven and the flink-dist jar in the original bin directory, so it might 
> get 3 jars as the result, which might break the command depends on it.
> For instance, the result of `find` can be:
> ```
> project_dir/flink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0-sources.jar
> project_dir/flink-dist/src/test/bin/../../../target/flink-1.10.0-bin/flink-1.10.0/lib/flink-dist_2.11-1.10.0.jar
> project_dirflink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0.jar
> ```
> Moreover, there's a redundant `}` in the command, which seems to be 
> accidentally skipped by the multiple-line result provided by `find`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17403) Fix invalid classpath in BashJavaUtilsITCase

2020-04-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093313#comment-17093313
 ] 

Xintong Song commented on FLINK-17403:
--

bq. How about restricting the search scope to `-maxdepth 1` under `target` 
directory, and joining the jars found with `:` before adding them to the 
classpath?

Sounds good to me.

[~aljoscha], could you help assign this ticket to [~Paul Lin]. Thanks.

> Fix invalid classpath in BashJavaUtilsITCase
> 
>
> Key: FLINK-17403
> URL: https://issues.apache.org/jira/browse/FLINK-17403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Paul Lin
>Priority: Major
> Fix For: 1.11.0, 1.10.2
>
>
> runBashJavaUtilsCmd.sh locates flink-dist.jar by `find` with pattern 
> `flink-dist*.jar`, but it doesn't filter out the flink-dist*-source.jar built 
> by maven and the flink-dist jar in the original bin directory, so it might 
> get 3 jars as the result, which might break the command depends on it.
> For instance, the result of `find` can be:
> ```
> project_dir/flink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0-sources.jar
> project_dir/flink-dist/src/test/bin/../../../target/flink-1.10.0-bin/flink-1.10.0/lib/flink-dist_2.11-1.10.0.jar
> project_dirflink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0.jar
> ```
> Moreover, there's a redundant `}` in the command, which seems to be 
> accidentally skipped by the multiple-line result provided by `find`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17403) Fix invalid classpath in BashJavaUtilsITCase

2020-04-27 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-17403:
-
Fix Version/s: 1.10.2
   1.11.0

> Fix invalid classpath in BashJavaUtilsITCase
> 
>
> Key: FLINK-17403
> URL: https://issues.apache.org/jira/browse/FLINK-17403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Paul Lin
>Priority: Major
> Fix For: 1.11.0, 1.10.2
>
>
> runBashJavaUtilsCmd.sh locates flink-dist.jar by `find` with pattern 
> `flink-dist*.jar`, but it doesn't filter out the flink-dist*-source.jar built 
> by maven and the flink-dist jar in the original bin directory, so it might 
> get 3 jars as the result, which might break the command depends on it.
> For instance, the result of `find` can be:
> ```
> project_dir/flink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0-sources.jar
> project_dir/flink-dist/src/test/bin/../../../target/flink-1.10.0-bin/flink-1.10.0/lib/flink-dist_2.11-1.10.0.jar
> project_dirflink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0.jar
> ```
> Moreover, there's a redundant `}` in the command, which seems to be 
> accidentally skipped by the multiple-line result provided by `find`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17403) Fix invalid classpath in BashJavaUtilsITCase

2020-04-27 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-17403:
-
Affects Version/s: 1.11.0

> Fix invalid classpath in BashJavaUtilsITCase
> 
>
> Key: FLINK-17403
> URL: https://issues.apache.org/jira/browse/FLINK-17403
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Paul Lin
>Priority: Major
>
> runBashJavaUtilsCmd.sh locates flink-dist.jar by `find` with pattern 
> `flink-dist*.jar`, but it doesn't filter out the flink-dist*-source.jar built 
> by maven and the flink-dist jar in the original bin directory, so it might 
> get 3 jars as the result, which might break the command depends on it.
> For instance, the result of `find` can be:
> ```
> project_dir/flink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0-sources.jar
> project_dir/flink-dist/src/test/bin/../../../target/flink-1.10.0-bin/flink-1.10.0/lib/flink-dist_2.11-1.10.0.jar
> project_dirflink-dist/src/test/bin/../../../target/flink-dist_2.11-1.10.0.jar
> ```
> Moreover, there's a redundant `}` in the command, which seems to be 
> accidentally skipped by the multiple-line result provided by `find`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17061) Unset process/flink memory size from configuration once dynamic worker resource is activated.

2020-04-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091463#comment-17091463
 ] 

Xintong Song commented on FLINK-17061:
--

FYI, this ticket is opened according to the discussion in the following PR.
https://github.com/apache/flink/pull/11353#discussion_r404639691

> Unset process/flink memory size from configuration once dynamic worker 
> resource is activated.
> -
>
> Key: FLINK-17061
> URL: https://issues.apache.org/jira/browse/FLINK-17061
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Configuration, Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Priority: Major
>
> With FLINK-14106, memory of a TaskExecutor is decided in two steps on active 
> resource managers.
> - {{SlotManager}} decides {{WorkerResourceSpec}}, including memory used by 
> Flink tasks: task heap, task off-heap, network and managed memory.
> - {{ResourceManager}} derives {{TaskExecutorProcessSpec}} from 
> {{WorkerResourceSpec}} and the configuration, deciding sizes of memory used 
> by Flink framework and JVM: framework heap, framework off-heap, jvm metaspace 
> and jvm overhead.
> This works fine for now, because both {{WorkerResourceSpec}} and 
> {{TaskExecutorProcessSpec}} are derived from the same configurations. 
> However, it might cause problem if later we have new {{SlotManager}} 
> implementations that decides {{WorkerResourceSpec}} dynamically. In such 
> cases, the process/flink sizes in configuration should be ignored, or it may 
> easily lead to configuration conflicts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17061) Unset process/flink memory size from configuration once dynamic worker resource is activated.

2020-04-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091460#comment-17091460
 ] 

Xintong Song commented on FLINK-17061:
--

As mentioned in the description, this is not a problem for now. It is only 
needed when later we have a slot manager implementation that dynamically decide 
TM resources according to the job resource requirements.

There could be problem not only the requested worker resource is larger than 
the flink/process memory, but also when it is much smaller than that.
With FLIP-49, we have the following equation.
{code}
processMemorySize = workerResourceSpec.totalMemorySize + frameworkMemorySize + 
jvmMetaspace + jvmOverhead
{code}

{{frameworkMemorySize}} and {{jvmMetaspace}} always have absolute values 
(explicitly configured or default), and {{jvmOverhead}} always have absolute 
min-max range.

If {{processMemorySize}} is also configured, we might have conflict when:
{code}
processMemorySize > workerResourceSpec.totalMemorySize + frameworkMemorySize + 
jvmMetaspace + jvmOverhead.max
{code}

> Unset process/flink memory size from configuration once dynamic worker 
> resource is activated.
> -
>
> Key: FLINK-17061
> URL: https://issues.apache.org/jira/browse/FLINK-17061
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Configuration, Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Priority: Major
>
> With FLINK-14106, memory of a TaskExecutor is decided in two steps on active 
> resource managers.
> - {{SlotManager}} decides {{WorkerResourceSpec}}, including memory used by 
> Flink tasks: task heap, task off-heap, network and managed memory.
> - {{ResourceManager}} derives {{TaskExecutorProcessSpec}} from 
> {{WorkerResourceSpec}} and the configuration, deciding sizes of memory used 
> by Flink framework and JVM: framework heap, framework off-heap, jvm metaspace 
> and jvm overhead.
> This works fine for now, because both {{WorkerResourceSpec}} and 
> {{TaskExecutorProcessSpec}} are derived from the same configurations. 
> However, it might cause problem if later we have new {{SlotManager}} 
> implementations that decides {{WorkerResourceSpec}} dynamically. In such 
> cases, the process/flink sizes in configuration should be ignored, or it may 
> easily lead to configuration conflicts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17061) Unset process/flink memory size from configuration once dynamic worker resource is activated.

2020-04-24 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091442#comment-17091442
 ] 

Xintong Song commented on FLINK-17061:
--

An example could be the slot manager requests a worker with more memory than 
the total flink/process memory size in configuration.



> Unset process/flink memory size from configuration once dynamic worker 
> resource is activated.
> -
>
> Key: FLINK-17061
> URL: https://issues.apache.org/jira/browse/FLINK-17061
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Configuration, Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Priority: Major
>
> With FLINK-14106, memory of a TaskExecutor is decided in two steps on active 
> resource managers.
> - {{SlotManager}} decides {{WorkerResourceSpec}}, including memory used by 
> Flink tasks: task heap, task off-heap, network and managed memory.
> - {{ResourceManager}} derives {{TaskExecutorProcessSpec}} from 
> {{WorkerResourceSpec}} and the configuration, deciding sizes of memory used 
> by Flink framework and JVM: framework heap, framework off-heap, jvm metaspace 
> and jvm overhead.
> This works fine for now, because both {{WorkerResourceSpec}} and 
> {{TaskExecutorProcessSpec}} are derived from the same configurations. 
> However, it might cause problem if later we have new {{SlotManager}} 
> implementations that decides {{WorkerResourceSpec}} dynamically. In such 
> cases, the process/flink sizes in configuration should be ignored, or it may 
> easily lead to configuration conflicts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-23 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091096#comment-17091096
 ] 

Xintong Song commented on FLINK-17273:
--

+1 for revisiting the boundary between {{ResourceManager}} and its deployment 
specific implementations.
I think this would help deduplicating the worker lifecycle control flow across 
deployments. The RM implementations should only handles the minimum set of 
deployment specific API/behavior differences.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17390) Container resource cannot be mapped on Hadoop 2.10+

2020-04-26 Thread Xintong Song (Jira)
Xintong Song created FLINK-17390:


 Summary: Container resource cannot be mapped on Hadoop 2.10+
 Key: FLINK-17390
 URL: https://issues.apache.org/jira/browse/FLINK-17390
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.11.0
Reporter: Xintong Song
 Fix For: 1.11.0


In FLINK-16438, we introduced {{WorkerSpecContainerResourceAdapter}} for 
mapping Yarn container {{Resource}} with Flink {{WorkerResourceSpec}}. Inside 
this class, we use {{Resource}} for hash map keys and set elements, assuming 
that {{Resource}} instances that describes the same set of resources have the 
same hash code.

This assumption is not always true. {{Resource}} is an abstract class and may 
have different implementations. In Hadoop 2.10+, {{LightWeightResource}}, a new 
implementation of {{Resource}}, is introduced for {{Resource}} generated by 
{{Resource.newInstance}} on the AM side, which overrides the {{hashCode}} 
method. That means, a {{Resource}} generated on AM may have a different hash 
code compared to an equal {{Resource}} returned from Yarn.

To solve this problem, we may introduce an {{InternalResource}} as an inner 
class of {{WorkerSpecContainerResourceAdapter}}, with {{hashCode}} method 
depends only on the fields needed by Flink (ATM memroy and vcores). 
{{WorkerSpecContainerResourceAdapter}} should only use {{InternalResource}} for 
internal state management, and do conversions for {{Resource}} passed into and 
returned from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17390) Container resource cannot be mapped on Hadoop 2.10+

2020-04-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092679#comment-17092679
 ] 

Xintong Song commented on FLINK-17390:
--

cc [~trohrmann]

> Container resource cannot be mapped on Hadoop 2.10+
> ---
>
> Key: FLINK-17390
> URL: https://issues.apache.org/jira/browse/FLINK-17390
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Priority: Blocker
> Fix For: 1.11.0
>
>
> In FLINK-16438, we introduced {{WorkerSpecContainerResourceAdapter}} for 
> mapping Yarn container {{Resource}} with Flink {{WorkerResourceSpec}}. Inside 
> this class, we use {{Resource}} for hash map keys and set elements, assuming 
> that {{Resource}} instances that describes the same set of resources have the 
> same hash code.
> This assumption is not always true. {{Resource}} is an abstract class and may 
> have different implementations. In Hadoop 2.10+, {{LightWeightResource}}, a 
> new implementation of {{Resource}}, is introduced for {{Resource}} generated 
> by {{Resource.newInstance}} on the AM side, which overrides the {{hashCode}} 
> method. That means, a {{Resource}} generated on AM may have a different hash 
> code compared to an equal {{Resource}} returned from Yarn.
> To solve this problem, we may introduce an {{InternalResource}} as an inner 
> class of {{WorkerSpecContainerResourceAdapter}}, with {{hashCode}} method 
> depends only on the fields needed by Flink (ATM memroy and vcores). 
> {{WorkerSpecContainerResourceAdapter}} should only use {{InternalResource}} 
> for internal state management, and do conversions for {{Resource}} passed 
> into and returned from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-21 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089219#comment-17089219
 ] 

Xintong Song commented on FLINK-17273:
--

I think this is a valid issue. +1 for fixing it.
IIUC, we can call {{ResourceManager#closeTaskManagerConnection}} in 
{{KubernetesResourceManager#internalStopPod}}?

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16733) Refactor YarnClusterDescriptor

2020-05-05 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099679#comment-17099679
 ] 

Xintong Song commented on FLINK-16733:
--

[~kkl0u],
No worries. Please continue with your PR for FLINK-17515.
My work on this ticket has been stalled already. There's no reason to block 
other efforts.
I'll rebase onto your changes once they're merged. I just had a glance on your 
PR, and from what I see there should not be much problem rebasing my codes onto 
it.

> Refactor YarnClusterDescriptor
> --
>
> Key: FLINK-16733
> URL: https://issues.apache.org/jira/browse/FLINK-16733
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Minor
>
> Currently, YarnClusterDescriptor is not in a good shape. It has 1600+ lines 
> of codes, of which the method {{startAppMaster}} alone has 400+ codes, 
> leading to poor maintainability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17422) Create user document for the external resource framework and the GPU plugin..

2020-04-27 Thread Xintong Song (Jira)
Xintong Song created FLINK-17422:


 Summary: Create user document for the external resource framework 
and the GPU plugin..
 Key: FLINK-17422
 URL: https://issues.apache.org/jira/browse/FLINK-17422
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.11.0
Reporter: Xintong Song
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15527) can not control the number of container on yarn single job module

2020-04-28 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094248#comment-17094248
 ] 

Xintong Song commented on FLINK-15527:
--

Hi all,
I believe this issue should be solved by FLINK-16605, which is very close to be 
finished. It would be appreciated if you could help confirm that. And if 
there's no objections, I'll try to close this ticket.
Thanks.

> can not control the number of container on yarn single job module
> -
>
> Key: FLINK-15527
> URL: https://issues.apache.org/jira/browse/FLINK-15527
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: chenchencc
>Priority: Major
>  Labels: usability
> Fix For: 1.11.0
>
> Attachments: flink-conf.yaml, image-2020-01-09-14-30-46-973.png, 
> yarn_application.png
>
>
> when run yarn single job run many container but paralism set 4
> *scripts:*
> ./bin/flink run -m yarn-cluster -ys 3 -p 4 -yjm 1024m -ytm 4096m -yqu bi -c 
> com.cc.test.HiveTest2 ./cc_jars/hive-1.0-SNAPSHOT.jar 11.txt test61 6
> _notes_: in  1.9.1 has cli paramter -yn to control the number of containers 
> and in 1.10 remove it
> *result:*
> the number of containers is 500+
>  
> *code use:*
> query the table and save it to the hdfs text
>  
> the storge of table is 200g+
>  
>  
>  
>  
> *code:*
> com.cc.test.HiveTest2
> public static void main(String[] args) throws Exception
> { EnvironmentSettings settings = 
> EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
>  StreamExecutionEnvironment settings2 = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> settings2.setParallelism(Integer.valueOf(args[2]));
> StreamTableEnvironment tableEnv = StreamTableEnvironment.create(settings2, 
> settings); String name = "myhive"; String defaultDatabase = "test"; String 
> hiveConfDir = "/etc/hive/conf"; String version = "1.2.1"; // or 1.2.1 2.3.4 
> HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, 
> version); tableEnv.registerCatalog("myhive", hive); 
> tableEnv.useCatalog("myhive"); tableEnv.listTables(); Table table = 
> tableEnv.sqlQuery("select id from orderparent_test2 where id = 
> 'A21204170176'"); tableEnv.toAppendStream(table, Row.class).print(); 
> tableEnv.toAppendStream(table, Row.class) 
> .writeAsText("hdfs:///user/chenchao1/"+ args[0], 
> FileSystem.WriteMode.OVERWRITE); tableEnv.execute(args[1]); }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17465) Update Chinese user documentation for job manager memory model

2020-04-29 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096053#comment-17096053
 ] 

Xintong Song commented on FLINK-17465:
--

[~azagrebin] I can work on this ticket. Please assign it to me.

> Update Chinese user documentation for job manager memory model
> --
>
> Key: FLINK-17465
> URL: https://issues.apache.org/jira/browse/FLINK-17465
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Andrey Zagrebin
>Priority: Major
> Fix For: 1.11.0
>
>
> This is a follow-up for FLINK-16946.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17493) Possible direct memory leak in cassandra sink

2020-05-17 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109803#comment-17109803
 ] 

Xintong Song commented on FLINK-17493:
--

[~nobleyd],

That sounds like memory leak to me.

What I'm not sure is whether the leak comes from Cassandra sink. Could you try 
whether [FLINK-16408|https://issues.apache.org/jira/browse/FLINK-16408] fixes 
your problem?

> Possible direct memory leak in cassandra sink
> -
>
> Key: FLINK-17493
> URL: https://issues.apache.org/jira/browse/FLINK-17493
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Cassandra
>Affects Versions: 1.9.0, 1.10.0
>Reporter: nobleyd
>Priority: Major
> Attachments: image-2020-05-14-21-58-59-152.png
>
>
> # Cassandra Sink use direct memorys.
>  # Start a standalone cluster(1 machines) for test.
>  # After the cluster started, check the flink web-ui, and record the task 
> manager's memory info. I mean the direct memory part info.
>  # Start a job which read from kafka and write to cassandra using the 
> cassandra sink, and you can see that the direct memory count in 'Outside JVM' 
> part go up.
>  # Stop the job, and the direct memory count is not decreased(using 'jmap 
> -histo:live pid' to make the task manager gc).
>  # Repeat serveral times, the direct memory count will be more and more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16267) Flink uses more memory than taskmanager.memory.process.size in Kubernetes

2020-05-17 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109802#comment-17109802
 ] 

Xintong Song commented on FLINK-16267:
--

[~trystan]

I'm sorry for your confusion.

As [~azagrebin] explained, `taskmanager.memory.process.size` is used for 
deciding the size of containers to request, and deciding JVM parameters (e.g., 
-Xmx) and sizes of memory pools which are fully controlled by Flink (e.g., 
network buffer pool). Flink tries to control its memory usages with best 
efforts, however, not all memory usages are controllable.
{quote}As a user I would like to be able to restrict a job to a certain known 
amount of memory. You can do this with a regular JVM application, why not Flink?
{quote}
I think this is not very accurate. Generally for a java application, there are 
three categories of memory types. 
 * JVM heap memory. This is memory virtualized by JVM, and can be controlled by 
the parameter "-Xmx".
 * Off-heap memory controlled by JVM. This is native OS memory which is outside 
JVM heap, but its usage is still limited by JVM. This includes direct memory 
and metaspace, with corresponding JVM parameters "-XX:MaxDirectMemorySize" and 
"-XX:MaxMetaspaceSize" for limiting their sizes.
 * Off-heap memory not controlled by JVM. This is native OS memory which is out 
side JVM heap, and the usage is not limited by the JVM.

The third category cannot be restricted by the java application, unless we do 
the bookkeeping for each memory allocation/deallocation. Flink do such 
bookkeeping for its managed memory, but cannot do that for user codes / 
libraries and JVM overhead (e.g., thread stack).

> Flink uses more memory than taskmanager.memory.process.size in Kubernetes
> -
>
> Key: FLINK-16267
> URL: https://issues.apache.org/jira/browse/FLINK-16267
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.10.0
>Reporter: ChangZhuo Chen (陳昌倬)
>Priority: Major
> Attachments: flink-conf_1.10.0.yaml, flink-conf_1.9.1.yaml, 
> oomkilled_taskmanager.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is from 
> [https://stackoverflow.com/questions/60336764/flink-uses-more-memory-than-taskmanager-memory-process-size-in-kubernetes]
> h1. Description
>  * In Flink 1.10.0, we try to use `taskmanager.memory.process.size` to limit 
> the resource used by taskmanager to ensure they are not killed by Kubernetes. 
> However, we still get lots of taskmanager `OOMKilled`. The setup is in the 
> following section.
>  * The taskmanager log is in attachment [^oomkilled_taskmanager.log].
> h2. Kubernete
>  * The Kubernetes setup is the same as described in 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html].
>  * The following is resource configuration for taskmanager deployment in 
> Kubernetes:
> {{resources:}}
>  {{  requests:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
>  {{  limits:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
> h2. Flink Docker
>  * The Flink docker is built by the following Docker file.
> {{FROM flink:1.10-scala_2.11}}
> RUN mkdir -p /opt/flink/plugins/s3 &&
> ln -s /opt/flink/opt/flink-s3-fs-presto-1.10.0.jar /opt/flink/plugins/s3/
>  {{RUN ln -s /opt/flink/opt/flink-metrics-prometheus-1.10.0.jar 
> /opt/flink/lib/}}
> h2. Flink Configuration
>  * The following are all memory related configurations in `flink-conf.yaml` 
> in 1.10.0:
> {{jobmanager.heap.size: 820m}}
>  {{taskmanager.memory.jvm-metaspace.size: 128m}}
>  {{taskmanager.memory.process.size: 4096m}}
>  * We use RocksDB and we don't set `state.backend.rocksdb.memory.managed` in 
> `flink-conf.yaml`.
>  ** Use S3 as checkpoint storage.
>  * The code uses DateStream API
>  ** input/output are both Kafka.
> h2. Project Dependencies
>  * The following is our dependencies.
> {{val flinkVersion = "1.10.0"}}{{libraryDependencies += 
> "com.squareup.okhttp3" % "okhttp" % "4.2.2"}}
>  {{libraryDependencies += "com.typesafe" % "config" % "1.4.0"}}
>  {{libraryDependencies += "joda-time" % "joda-time" % "2.10.5"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" % 
> flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" % "flink-metrics-dropwizard" % 
> flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion 
> % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-statebackend-rocksdb" 
> % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % 
> flinkVersion % "provided"}}
>  {{libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.7"}}
>  {{libraryDependencies += "org.log4s" %% "log4s" % "1.8.2"}}
>  {{libraryDependencies += 

[jira] [Created] (FLINK-17551) Documentation of stable releases are actually built on top of snapshot code bases.

2020-05-06 Thread Xintong Song (Jira)
Xintong Song created FLINK-17551:


 Summary: Documentation of stable releases are actually built on 
top of snapshot code bases.
 Key: FLINK-17551
 URL: https://issues.apache.org/jira/browse/FLINK-17551
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Affects Versions: 1.10.0
Reporter: Xintong Song


When browsing Flink's documentation on the project website, we can choose from 
both the latest snapshot version and the stable release versions. However, it 
seems the documentation of stable release version is actually built on top of 
the snapshot version of the release branch.

E.g., currently the latest stable release is 1.10.0, but the documentation 
described as "Flink 1.10 (Latest stable release)" is actually built with 
1.10-SNAPSHOT. As a consequence, users might be confused when they use release 
1.10.0 and some latest documentation changes meant for 1.10.1.

[This 
comment|https://github.com/apache/flink/pull/11300#issuecomment-624776199] 
shows one of such confusions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17551) Documentation of stable releases are actually built on top of snapshot code bases.

2020-05-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101511#comment-17101511
 ] 

Xintong Song commented on FLINK-17551:
--

[~chesnay], thanks for the explanation.

I see your point. Indeed, maintaining documentation per each released version 
is not trivial. The cherry-picks can be troublesome given that documentation 
fixes are not always independent from production code changes.

I don't have any good idea to solve this, either. So I'll just close this 
ticket.

> Documentation of stable releases are actually built on top of snapshot code 
> bases.
> --
>
> Key: FLINK-17551
> URL: https://issues.apache.org/jira/browse/FLINK-17551
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
>
> When browsing Flink's documentation on the project website, we can choose 
> from both the latest snapshot version and the stable release versions. 
> However, it seems the documentation of stable release version is actually 
> built on top of the snapshot version of the release branch.
> E.g., currently the latest stable release is 1.10.0, but the documentation 
> described as "Flink 1.10 (Latest stable release)" is actually built with 
> 1.10-SNAPSHOT. As a consequence, users might be confused when they use 
> release 1.10.0 and some latest documentation changes meant for 1.10.1.
> [This 
> comment|https://github.com/apache/flink/pull/11300#issuecomment-624776199] 
> shows one of such confusions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17551) Documentation of stable releases are actually built on top of snapshot code bases.

2020-05-07 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-17551.

Resolution: Won't Fix

> Documentation of stable releases are actually built on top of snapshot code 
> bases.
> --
>
> Key: FLINK-17551
> URL: https://issues.apache.org/jira/browse/FLINK-17551
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Affects Versions: 1.10.0
>Reporter: Xintong Song
>Priority: Major
>
> When browsing Flink's documentation on the project website, we can choose 
> from both the latest snapshot version and the stable release versions. 
> However, it seems the documentation of stable release version is actually 
> built on top of the snapshot version of the release branch.
> E.g., currently the latest stable release is 1.10.0, but the documentation 
> described as "Flink 1.10 (Latest stable release)" is actually built with 
> 1.10-SNAPSHOT. As a consequence, users might be confused when they use 
> release 1.10.0 and some latest documentation changes meant for 1.10.1.
> [This 
> comment|https://github.com/apache/flink/pull/11300#issuecomment-624776199] 
> shows one of such confusions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15154) Change Flink binding addresses in local mode

2020-05-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101562#comment-17101562
 ] 

Xintong Song commented on FLINK-15154:
--

Updates:

After FLINK-15911, there are still two ports that Flink binds to whose binding 
address is not configurable.
- Blob server
- Metrics query RPC service

I've opened a PR trying to fix this issue, with the following changes.
- Make Blob server respect the configuration option 'jobmanager.bind-host' 
(introduced by FLINK-15911)
- Make metrics query RPC service use Akka local actor system in local execution 
mode, to avoid unnecessary port binding.

> Change Flink binding addresses in local mode
> 
>
> Key: FLINK-15154
> URL: https://issues.apache.org/jira/browse/FLINK-15154
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.1
> Environment: ```
> $ uname -a
> Linux xxx 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 
> GNU/Linux
> ```
>Reporter: Andrea Cardaci
>Assignee: Xintong Song
>Priority: Minor
>  Labels: pull-request-available, usability
>
> Flink (or some of its services) listens on three random TCP ports
> during the local[1] execution, e.g., 39951, 41009 and 42849.
> [1]: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html#local-environment
> The sockets listens on `0.0.0.0` and since I need to run some
> long-running tests on an Internet-facing machine I was wondering how
> to make them listen on `localhost` instead or if there is anything
> else I can do to improve the security in this scenario.
> Here's what I tried (with little luck):
> ```
> Configuration config = new Configuration();
> config.setString("taskmanager.host", "127.0.0.1");
> cconfig.setString("rest.bind-address", "127.0.0.1"); // OK
> config.setString("jobmanager.rpc.address", "127.0.0.1");
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.createLocalEnvironment(StreamExecutionEnvironment.getDefaultLocalParallelism(),
>  config);
> ```
> Only the `rest.bind-address` configuration actually changes the
> binding address of one of those ports. Are there other parameters that
> I'm not aware of or this is not the right approach in local mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-11 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104186#comment-17104186
 ] 

Xintong Song commented on FLINK-17560:
--

[~josson],
The slot is marked FREE with the same code you linked. The same method will be 
called again when all the remaining tasks are canceled and removed.
You can take a look at {{TaskSlotTable#removeTask}}. The logics in this method 
causes {{TaskSlotTable#freeSlot}} being called again when all tasks are 
removed. From its call hierarchy you can find that this method is called when a 
task is finished/failed/canceled.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17493) Possible direct memory leak in cassandra sink

2020-05-13 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106814#comment-17106814
 ] 

Xintong Song commented on FLINK-17493:
--

FYI, there's an improvement in the upcoming 1.11 release that might solve your 
problems.
https://issues.apache.org/jira/browse/FLINK-16408

With this feature, we use separated class loader for executing user codes 
(i.e., codes for one specific job), and the class loader's life cycle ends when 
the job is terminated.

> Possible direct memory leak in cassandra sink
> -
>
> Key: FLINK-17493
> URL: https://issues.apache.org/jira/browse/FLINK-17493
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Cassandra
>Affects Versions: 1.9.3, 1.10.0
>Reporter: nobleyd
>Priority: Major
>
> # Cassandra Sink use direct memorys.
>  # Start a standalone cluster(1 machines) for test.
>  # After the cluster started, check the flink web-ui, and record the task 
> manager's memory info. I mean the direct memory part info.
>  # Start a job which read from kafka and write to cassandra using the 
> cassandra sink, and you can see that the direct memory count in 'Outside JVM' 
> part go up.
>  # Stop the job, and the direct memory count is not decreased(using 'jmap 
> -histo:live pid' to make the task manager gc).
>  # Repeat serveral times, the direct memory count will be more and more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17677) FLINK_LOG_PREFIX recommended in docs is not always available

2020-05-13 Thread Xintong Song (Jira)
Xintong Song created FLINK-17677:


 Summary: FLINK_LOG_PREFIX recommended in docs is not always 
available
 Key: FLINK-17677
 URL: https://issues.apache.org/jira/browse/FLINK-17677
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.10.1, 1.9.3, 1.11.0
Reporter: Xintong Song
 Fix For: 1.11.0, 1.10.2, 1.9.4


The [Application Profiling & 
Debugging|https://ci.apache.org/projects/flink/flink-docs-master/monitoring/application_profiling.html]
 documentation recommend to use the script variable {{FLINK_LOG_PREFIX}} for 
defining log file paths. However, this variable is only available in standalone 
mode. This is a bit misleading for users of other deployments (see this 
[thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Memory-analyze-on-AWS-EMR-td35036.html]).

I propose to replace {{FLINK_LOG_PREFIX}} with a general representation 
{{}}, and add a separate section to discuss how to set the log 
path (e.g., use {{FLINK_LOG_PREFIX}} with standalone deployments and 
{{}} with Yarn deployments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17493) Possible direct memory leak in cassandra sink

2020-05-13 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106809#comment-17106809
 ] 

Xintong Song commented on FLINK-17493:
--

Hi [~destynova],

I'm not entirely sure whether your problem is same as the others or not.

This {{NoHostAvailableException}} you encountered seems to be unrelated. The 
error stack does not suggest anything related to the memory, and the exception 
has not been reported by [~nobleyd] or [~monika.h].

The observation that off-heap / direct memory grows every time job is restarted 
might be caused by the same problem as the others.

>From what you described, it does not feels like a memory leak problem to me. 
>If there's indeed a memory leak, the memory footprint should always grow as 
>job restarted, and eventually you should run into a metaspace / direct OOM. 
>According to your description, the memory footprint increases initially but 
>eventually become stable. This might due to that JVM has not release the 
>memory until the limits are reached and full GC is triggered. To verify this, 
>you can try to manually trigger full GC and see if the off-heap / direct 
>memory footprint falls down.

To create a heap dump, you have to login to the TM host machine / container / 
pod and execute jmap / jcmd directly on the JVM process. 

> Possible direct memory leak in cassandra sink
> -
>
> Key: FLINK-17493
> URL: https://issues.apache.org/jira/browse/FLINK-17493
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Cassandra
>Affects Versions: 1.9.3, 1.10.0
>Reporter: nobleyd
>Priority: Major
>
> # Cassandra Sink use direct memorys.
>  # Start a standalone cluster(1 machines) for test.
>  # After the cluster started, check the flink web-ui, and record the task 
> manager's memory info. I mean the direct memory part info.
>  # Start a job which read from kafka and write to cassandra using the 
> cassandra sink, and you can see that the direct memory count in 'Outside JVM' 
> part go up.
>  # Stop the job, and the direct memory count is not decreased(using 'jmap 
> -histo:live pid' to make the task manager gc).
>  # Repeat serveral times, the direct memory count will be more and more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14106) Make SlotManager pluggable

2020-05-14 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-14106:
-
Fix Version/s: (was: 1.11.0)
   1.12.0

> Make SlotManager pluggable
> --
>
> Key: FLINK-14106
> URL: https://issues.apache.org/jira/browse/FLINK-14106
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 1.12.0
>
>
> As we are enabling fine grained resource management in 1.10, we can have 
> various resource scheduling strategies. Such strategies generally should make 
> the following three decisions.
>  * When to launch new / release existing TMs? (How many TMs)
>  * What and how many resources should TMs be started with?
>  * How to allocate between slot requests and TM resources?
> We may want to make above decisions differently in different scenarios 
> (active/reactive mode, perjob/session mode, etc.). Therefore, we propose to 
> make the scheduling strategies pluggable.
> We propose to make the following changes:
>  * Make SlotManager an interface, and implements it differently for different 
> strategies strategies.
>  * Modify ResourceManager-SlotManager interfaces to cover all the three 
> decisions mentioned above in SlotManager. In particular, SlotManager needs to 
> allocate TM resources instead of slot resources from ResourceActions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14187) FLIP-56 Dynamic Slot Allocation

2020-05-14 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-14187:
-
Fix Version/s: (was: 1.11.0)
   1.12.0

> FLIP-56 Dynamic Slot Allocation
> ---
>
> Key: FLINK-14187
> URL: https://issues.apache.org/jira/browse/FLINK-14187
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
>  Labels: Umbrella
> Fix For: 1.12.0
>
>
> This is the umbrella issue for 'FLIP-56: Dynamic Slot Allocation'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >