[jira] [Closed] (FLINK-2629) Refactor YARN ApplicationMasterActor to use the async AM<-->RM client
[ https://issues.apache.org/jira/browse/FLINK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels closed FLINK-2629. - Resolution: Duplicate > Refactor YARN ApplicationMasterActor to use the async AM<-->RM client > -- > > Key: FLINK-2629 > URL: https://issues.apache.org/jira/browse/FLINK-2629 > Project: Flink > Issue Type: Improvement > Components: YARN Client >Affects Versions: 0.10.0 >Reporter: Robert Metzger > > YARN also offers an async client for the AM <--> RM communication: > https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.html > Currently, container failures can block the entire JobManager actor systems > for a few seconds because of blocking AM <---> RM calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2936] Fix ClassCastException for Event-...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1448#issuecomment-163918357 @StephanEwen do we want to fix it like this now, or do we want to integrate it into the StreamRecordSerializer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2936) ClassCastException when using EventTimeSourceFunction in non-EventTime program
[ https://issues.apache.org/jira/browse/FLINK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052655#comment-15052655 ] ASF GitHub Bot commented on FLINK-2936: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1448#issuecomment-163918357 @StephanEwen do we want to fix it like this now, or do we want to integrate it into the StreamRecordSerializer? > ClassCastException when using EventTimeSourceFunction in non-EventTime program > -- > > Key: FLINK-2936 > URL: https://issues.apache.org/jira/browse/FLINK-2936 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.10.0 >Reporter: Fabian Hueske >Assignee: Aljoscha Krettek > Fix For: 1.0.0 > > > Using an {{EventTimeSourceFunction}} in a DataStream programs that does not > operate with {{TimeCharacteristic.EventTime}} leads to a > {{ClassCastException}} when the first {{Watermark}} is emitted: > {code} > Caused by: java.lang.ClassCastException: > org.apache.flink.streaming.api.watermark.Watermark cannot be cast to > org.apache.flink.streaming.runtime.streamrecord.StreamRecord > {code} > This exception is not very helpful for users that simply for got to set the > correct {{TimeCharacteristic}} and should be improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2769) Web dashboard port not configurable on client side
[ https://issues.apache.org/jira/browse/FLINK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052551#comment-15052551 ] ASF GitHub Bot commented on FLINK-2769: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1449#issuecomment-163893598 Thanks for the reviews Robert and Sachin! I've addressed Robert's comment. If there are no objections, I would like to merge this. > Web dashboard port not configurable on client side > -- > > Key: FLINK-2769 > URL: https://issues.apache.org/jira/browse/FLINK-2769 > Project: Flink > Issue Type: Bug > Components: Webfrontend >Affects Versions: 0.10.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Critical > Fix For: 0.10.0 > > > The new web dashboard port configuration only affects the server side, but > not the client side. > To reproduce set in {{flink-conf.yaml}}: > {code} > jobmanager.web.port: 9091 > jobmanager.new-web-frontend: true > {code} > Run > {code} > $ bin/start-cluster.sh > {code} > You can browse to {{http://localhost:9091}}, but you won't see any data from > the job manager. Requests to http://localhost:9091/overview etc. work as > expected, but the client side JavaScript is running requests against > {{http://localhost:8081}}. > The problem is that > {{flink-runtime-web/web-dashboard/app/scripts/index.coffee}} hard codes: > {code} > .value 'flinkConfig', { > jobServer: 'http://localhost:8081' > } > {code} which is picked up by all generated scripts. > This needs to be configurable for both standalone mode but especially for > recovery mode. This was a major pain point for me today, because I was > working on the dashboard and thought that my changes were wrong. > In general, we need to add more tests to the new dashboard, which should soon > replace the old one imo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2769] [runtime-web] Add configurable jo...
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1449#issuecomment-163893598 Thanks for the reviews Robert and Sachin! I've addressed Robert's comment. If there are no objections, I would like to merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2769] [runtime-web] Add configurable jo...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1449#issuecomment-163953942 I think its good to merge, +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2769) Web dashboard port not configurable on client side
[ https://issues.apache.org/jira/browse/FLINK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052853#comment-15052853 ] ASF GitHub Bot commented on FLINK-2769: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1449#issuecomment-163953942 I think its good to merge, +1 > Web dashboard port not configurable on client side > -- > > Key: FLINK-2769 > URL: https://issues.apache.org/jira/browse/FLINK-2769 > Project: Flink > Issue Type: Bug > Components: Webfrontend >Affects Versions: 0.10.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Critical > Fix For: 0.10.0 > > > The new web dashboard port configuration only affects the server side, but > not the client side. > To reproduce set in {{flink-conf.yaml}}: > {code} > jobmanager.web.port: 9091 > jobmanager.new-web-frontend: true > {code} > Run > {code} > $ bin/start-cluster.sh > {code} > You can browse to {{http://localhost:9091}}, but you won't see any data from > the job manager. Requests to http://localhost:9091/overview etc. work as > expected, but the client side JavaScript is running requests against > {{http://localhost:8081}}. > The problem is that > {{flink-runtime-web/web-dashboard/app/scripts/index.coffee}} hard codes: > {code} > .value 'flinkConfig', { > jobServer: 'http://localhost:8081' > } > {code} which is picked up by all generated scripts. > This needs to be configurable for both standalone mode but especially for > recovery mode. This was a major pain point for me today, because I was > working on the dashboard and thought that my changes were wrong. > In general, we need to add more tests to the new dashboard, which should soon > replace the old one imo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2936) ClassCastException when using EventTimeSourceFunction in non-EventTime program
[ https://issues.apache.org/jira/browse/FLINK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052782#comment-15052782 ] ASF GitHub Bot commented on FLINK-2936: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1448#issuecomment-163940693 Probably okay to do it like this. How hard is it to pass the "timestamps" flag into the constructor of the SourceOperator, rather than through the runtime context? > ClassCastException when using EventTimeSourceFunction in non-EventTime program > -- > > Key: FLINK-2936 > URL: https://issues.apache.org/jira/browse/FLINK-2936 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.10.0 >Reporter: Fabian Hueske >Assignee: Aljoscha Krettek > Fix For: 1.0.0 > > > Using an {{EventTimeSourceFunction}} in a DataStream programs that does not > operate with {{TimeCharacteristic.EventTime}} leads to a > {{ClassCastException}} when the first {{Watermark}} is emitted: > {code} > Caused by: java.lang.ClassCastException: > org.apache.flink.streaming.api.watermark.Watermark cannot be cast to > org.apache.flink.streaming.runtime.streamrecord.StreamRecord > {code} > This exception is not very helpful for users that simply for got to set the > correct {{TimeCharacteristic}} and should be improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2936] Fix ClassCastException for Event-...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1448#issuecomment-163940693 Probably okay to do it like this. How hard is it to pass the "timestamps" flag into the constructor of the SourceOperator, rather than through the runtime context? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2936] Fix ClassCastException for Event-...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1448 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3121] Emit Final Watermark in Kafka Sou...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1447 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3134) Make YarnJobManager's allocate call asynchronous
[ https://issues.apache.org/jira/browse/FLINK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052874#comment-15052874 ] ASF GitHub Bot commented on FLINK-3134: --- GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/1450 [FLINK-3134][yarn] asynchronous YarnJobManager heartbeats - use AMRMClientAsync instead of AMRMClient - handle allocation and startup of containers in callbacks - remove YarnHeartbeat message The AMRMClientAsync uses one thread to communicate with the resource manager and an additional thread to execute the callbacks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink yart-heartbeat-async Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1450 commit 8ec6d492795801ffc7c76f37a3a3200bbf85f81c Author: Maximilian MichelsDate: 2015-12-08T13:35:04Z [FLINK-3134][yarn] asynchronous YarnJobManager heartbeats - use AMRMClientAsync instead of AMRMClient - handle allocation and startup of containers in callbacks - remove YarnHeartbeat message The AMRMClientAsync uses one thread to communicate with the resource manager and an additional thread to execute the callbacks. > Make YarnJobManager's allocate call asynchronous > > > Key: FLINK-3134 > URL: https://issues.apache.org/jira/browse/FLINK-3134 > Project: Flink > Issue Type: Bug > Components: YARN Client >Affects Versions: 0.10.0, 1.0.0, 0.10.1 >Reporter: Maximilian Michels >Assignee: Maximilian Michels > Fix For: 1.0.0 > > > The {{allocate()}} call is used in the {{YarnJobManager}} to send a heartbeat > to the YARN resource manager. This call may block the JobManager actor system > for arbitrary time, e.g. if retry handlers are set up within the call to > allocate. > I propose to use the {{AMRMClientAsync}}'s callback methods to send > heartbeats and update the container information. The API is available for our > supported Hadoop versions (2.3.0 and above). > https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3121) Watermark forwarding does not work for sources not producing any data
[ https://issues.apache.org/jira/browse/FLINK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052989#comment-15052989 ] ASF GitHub Bot commented on FLINK-3121: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1447 > Watermark forwarding does not work for sources not producing any data > - > > Key: FLINK-3121 > URL: https://issues.apache.org/jira/browse/FLINK-3121 > Project: Flink > Issue Type: Bug >Reporter: Robert Metzger >Assignee: Aljoscha Krettek >Priority: Blocker > > This mailing list discussion explains the issue in detail: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Custom-TimestampExtractor-and-FlinkKafkaConsumer082-td3488.html > As a workaround for now, the Kafka source can emit a final watermark for > sources not producing any data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2936) ClassCastException when using EventTimeSourceFunction in non-EventTime program
[ https://issues.apache.org/jira/browse/FLINK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052990#comment-15052990 ] ASF GitHub Bot commented on FLINK-2936: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1448 > ClassCastException when using EventTimeSourceFunction in non-EventTime program > -- > > Key: FLINK-2936 > URL: https://issues.apache.org/jira/browse/FLINK-2936 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.10.0 >Reporter: Fabian Hueske >Assignee: Aljoscha Krettek > Fix For: 1.0.0 > > > Using an {{EventTimeSourceFunction}} in a DataStream programs that does not > operate with {{TimeCharacteristic.EventTime}} leads to a > {{ClassCastException}} when the first {{Watermark}} is emitted: > {code} > Caused by: java.lang.ClassCastException: > org.apache.flink.streaming.api.watermark.Watermark cannot be cast to > org.apache.flink.streaming.runtime.streamrecord.StreamRecord > {code} > This exception is not very helpful for users that simply for got to set the > correct {{TimeCharacteristic}} and should be improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3160) Aggregate operator statistics by TaskManager
Greg Hogan created FLINK-3160: - Summary: Aggregate operator statistics by TaskManager Key: FLINK-3160 URL: https://issues.apache.org/jira/browse/FLINK-3160 Project: Flink Issue Type: Improvement Components: Webfrontend Affects Versions: 1.0.0 Reporter: Greg Hogan The web client job info page presents a table of the following per task statistics: start time, end time, duration, bytes received, records received, bytes sent, records sent, attempt, host, status. Flink supports clusters with thousands of slots and a job setting a high parallelism renders this job info page unwieldy and difficult to analyze in real-time. It would be helpful to optionally or automatically aggregate statistics by TaskManager. These rows could then be expanded to reveal the current per task statistics. Start time, end time, duration, and attempt are not applicable to a TaskManager since new tasks for repeated attempts may be started. Bytes received, records received, bytes sent, and records sent are summed. Any throughput metrics can be averaged over the total task time or time window. Status could reference the number of running tasks on the TaskManager or an idle state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-3121) Watermark forwarding does not work for sources not producing any data
[ https://issues.apache.org/jira/browse/FLINK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek resolved FLINK-3121. - Resolution: Fixed Fix Version/s: 1.0.0 Fixed in https://github.com/apache/flink/commit/6bd5714d2a045e581b1a761830d010598f803de7 > Watermark forwarding does not work for sources not producing any data > - > > Key: FLINK-3121 > URL: https://issues.apache.org/jira/browse/FLINK-3121 > Project: Flink > Issue Type: Bug >Reporter: Robert Metzger >Assignee: Aljoscha Krettek >Priority: Blocker > Fix For: 1.0.0 > > > This mailing list discussion explains the issue in detail: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Custom-TimestampExtractor-and-FlinkKafkaConsumer082-td3488.html > As a workaround for now, the Kafka source can emit a final watermark for > sources not producing any data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2936) ClassCastException when using EventTimeSourceFunction in non-EventTime program
[ https://issues.apache.org/jira/browse/FLINK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek resolved FLINK-2936. - Resolution: Fixed Fixed in https://github.com/apache/flink/commit/4b648870b4673c5a9c4d80f185e7de679967098e > ClassCastException when using EventTimeSourceFunction in non-EventTime program > -- > > Key: FLINK-2936 > URL: https://issues.apache.org/jira/browse/FLINK-2936 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.10.0 >Reporter: Fabian Hueske >Assignee: Aljoscha Krettek > Fix For: 1.0.0 > > > Using an {{EventTimeSourceFunction}} in a DataStream programs that does not > operate with {{TimeCharacteristic.EventTime}} leads to a > {{ClassCastException}} when the first {{Watermark}} is emitted: > {code} > Caused by: java.lang.ClassCastException: > org.apache.flink.streaming.api.watermark.Watermark cannot be cast to > org.apache.flink.streaming.runtime.streamrecord.StreamRecord > {code} > This exception is not very helpful for users that simply for got to set the > correct {{TimeCharacteristic}} and should be improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3161) Externalize cluster start-up and tear-down when available
Greg Hogan created FLINK-3161: - Summary: Externalize cluster start-up and tear-down when available Key: FLINK-3161 URL: https://issues.apache.org/jira/browse/FLINK-3161 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Affects Versions: 1.0.0 Reporter: Greg Hogan Assignee: Greg Hogan Priority: Minor I have been using pdsh, pdcp, and rpdcp to both distribute compiled Flink and to start and stop the TaskManagers. The current shell script initializes TaskManagers one-at-a-time. This is trivial to background but would be unthrottled. >From pdsh's archived homepage: "uses a sliding window of threads to execute >remote commands, conserving socket resources while allowing some connections >to timeout if needed". What other tools could be supported when available? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3134][yarn] asynchronous YarnJobManager...
GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/1450 [FLINK-3134][yarn] asynchronous YarnJobManager heartbeats - use AMRMClientAsync instead of AMRMClient - handle allocation and startup of containers in callbacks - remove YarnHeartbeat message The AMRMClientAsync uses one thread to communicate with the resource manager and an additional thread to execute the callbacks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink yart-heartbeat-async Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1450 commit 8ec6d492795801ffc7c76f37a3a3200bbf85f81c Author: Maximilian MichelsDate: 2015-12-08T13:35:04Z [FLINK-3134][yarn] asynchronous YarnJobManager heartbeats - use AMRMClientAsync instead of AMRMClient - handle allocation and startup of containers in callbacks - remove YarnHeartbeat message The AMRMClientAsync uses one thread to communicate with the resource manager and an additional thread to execute the callbacks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-7) [GitHub] Enable Range Partitioner
[ https://issues.apache.org/jira/browse/FLINK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052409#comment-15052409 ] ASF GitHub Bot commented on FLINK-7: Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1255#issuecomment-163873769 Thanks for the update. Looks good. Will try it once more on a cluster. @uce, @StephanEwen , we inject a `DataExchangeMode.BATCH` for range partitioning (`RangePartitionRewriter`, line 168). IIRC, there are some implication wrt. to iterations. Will that work? > [GitHub] Enable Range Partitioner > - > > Key: FLINK-7 > URL: https://issues.apache.org/jira/browse/FLINK-7 > Project: Flink > Issue Type: Sub-task > Components: Distributed Runtime >Reporter: GitHub Import >Assignee: Chengxiang Li > Fix For: pre-apache > > > The range partitioner is currently disabled. We need to implement the > following aspects: > 1) Distribution information, if available, must be propagated back together > with the ordering property. > 2) A generic bucket lookup structure (currently specific to PactRecord). > Tests to re-enable after fixing this issue: > - TeraSortITCase > - GlobalSortingITCase > - GlobalSortingMixedOrderITCase > Imported from GitHub > Url: https://github.com/stratosphere/stratosphere/issues/7 > Created by: [StephanEwen|https://github.com/StephanEwen] > Labels: core, enhancement, optimizer, > Milestone: Release 0.4 > Assignee: [fhueske|https://github.com/fhueske] > Created at: Fri Apr 26 13:48:24 CEST 2013 > State: open -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-7] [Runtime] Enable Range Partitioner.
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1255#issuecomment-163873769 Thanks for the update. Looks good. Will try it once more on a cluster. @uce, @StephanEwen , we inject a `DataExchangeMode.BATCH` for range partitioning (`RangePartitionRewriter`, line 168). IIRC, there are some implication wrt. to iterations. Will that work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-3162) Configure number of TaskManager slots as ratio of available processors
Greg Hogan created FLINK-3162: - Summary: Configure number of TaskManager slots as ratio of available processors Key: FLINK-3162 URL: https://issues.apache.org/jira/browse/FLINK-3162 Project: Flink Issue Type: Improvement Components: Distributed Runtime Affects Versions: 1.0.0 Reporter: Greg Hogan Priority: Minor The number of TaskManager slots is currently only configurable by explicitly setting {{taskmanager.numberOfTaskSlots}}. Make this configurable by a ratio of the number of available processors (for example, "2", for hyperthreading). This can work in the same way as {{taskmanager.memory.size}} and {{taskmanager.memory.fraction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3164) Spread out scheduling strategy
Greg Hogan created FLINK-3164: - Summary: Spread out scheduling strategy Key: FLINK-3164 URL: https://issues.apache.org/jira/browse/FLINK-3164 Project: Flink Issue Type: Improvement Components: Distributed Runtime, Java API, Scala API Affects Versions: 1.0.0 Reporter: Greg Hogan The size of a Flink cluster is bounded by the amount of memory allocated for network buffers. The all-to-all distribution of data during a network shuffle means that doubling the number of TaskManager slots quadruples the required number of network buffers. A Flink job can be configured to execute operators with lower parallelism which reduces the number of network buffers used across the cluster. Since the Flink scheduler clusters tasks the number of network buffers to be configured cannot be reduced. For example, if each TaskManager has 32 slots and the cluster has 32 TaskManagers the maximum parallelism can be set to 1024. If the preceding operator has a parallelism of 32 then the TaskManager fan-out is between 1*1024 (tasks evenly distributed) and 32*1024 (executed on a single TaskManager). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3163) Configure Flink for NUMA systems
Greg Hogan created FLINK-3163: - Summary: Configure Flink for NUMA systems Key: FLINK-3163 URL: https://issues.apache.org/jira/browse/FLINK-3163 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Affects Versions: 1.0.0 Reporter: Greg Hogan Assignee: Greg Hogan On NUMA systems Flink can be pinned to a single physical processor ("node") using {{numactl --membind=$node --cpunodebind=$node }}. Commonly available NUMA systems include the largest AWS and Google Compute instances. For example, on an AWS c4.8xlarge system with 36 hyperthreads the user could configure a single TaskManager with 36 slots or have Flink create two TaskManagers bound to each of the NUMA nodes, each with 18 slots. There may be some extra overhead in transferring network buffers between TaskManagers on the same system, though the fraction of data shuffled in this manner decreases with the size of the cluster. The performance improvement from only accessing local memory looks to be significant though difficult to benchmark. The JobManagers may fit into NUMA nodes rather than requiring full systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)