[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2021-02-12 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283914#comment-17283914
 ] 

Zhu Zhu commented on FLINK-10320:
-

I think it is a duplicate and we can close it since FLINK-20612 is done. 
Regarding the benchmarks for RPC requests, I prefer to open a separate ticket 
to do it when needed.
cc [~tison]

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Tests
>Reporter: Zili Chen
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2021-02-12 Thread Zhilong Hong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283767#comment-17283767
 ] 

Zhilong Hong commented on FLINK-10320:
--

Thanks for reminding, [~chesnay] and [~pnowojski].

I think the current scheduler benchmark has covered the goal #1 (i.e. "How fast 
the job switch into RUNNING, or say could we start the job faster"). It 
involves several procedures with high computation complexity. In fact after the 
optimization we make in FLINK 21110, the main throttle of deploying a job lies 
in computing pipelined region. We will try to optimize it in future. 

For the goal #2 (i.e. "What the throughput JobMaster reacts rpc requests"), we 
are still thinking about it. The first concern that comes to me is, the RPC we 
mock locally is different from the real situation. First, we cannot simulate 
the network connection latency. I think this may greatly impact the performance 
of RPC if the communications reach the maximum bandwidth (in the worst 
scenario). Second, the thread model is totally different. Currently the future 
executor in JobMaster has a thread pool that uses all the CPU cores on the 
machine. If we start threads to simulate TaskExecutor on the same machine, the 
mocked TM may impact the performance of JobMaster. For example, 
{{Execution#submitTask}} runs on future executors, as 
{{TaskExecutor#submitTask}} runs on the main thread of TaskExecutors.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Tests
>Reporter: Zili Chen
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2021-02-12 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283648#comment-17283648
 ] 

Piotr Nowojski commented on FLINK-10320:


CC [~Thesharing] [~zhuzh]?

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Tests
>Reporter: Zili Chen
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2021-02-12 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283593#comment-17283593
 ] 

Chesnay Schepler commented on FLINK-10320:
--

Is this a duplicate of FLINK-20612?

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Tests
>Reporter: Zili Chen
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-21 Thread Piotr Nowojski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623481#comment-16623481
 ] 

Piotr Nowojski commented on FLINK-10320:


[~till.rohrmann] might have a good point. [~Tison] could you provide a profiler 
logs for both JobManager and the TaskManager during this 10,000 parallelism 
scheduling issue? Maybe we could even narrow down the problematic component and 
write benchmarks target specifically for it instead of for whole JobManager?

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: tison
>Assignee: tison
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-20 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622747#comment-16622747
 ] 

Till Rohrmann commented on FLINK-10320:
---

I'm also not 100 percent sure that a scheduling benchmark will help us a lot. 
Most problems we have seen with scheduling of large scale topologies were the 
serialization overhead of the deployment messages and the state handles (e.g. 
when using union state). This would not be tested when using mock 
{{TaskExecutorGateways}}. Thus, my concern would be that without knowing 
exactly where the bottleneck is that we develop a benchmark for the wrong part 
of Flink. 

On the other hand, more test coverage and also benchmarks for performance 
regressions are in general a good idea (given that the maintenance is doable).

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-20 Thread Piotr Nowojski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622088#comment-16622088
 ] 

Piotr Nowojski commented on FLINK-10320:


Ok thank you for providing motivation :)

You could try to setup the benchmark as you described (with some dummy 
{{TaskManagers}}). Other possible way to provide such benchmarks would be to 
just spam a JobMaster with some fake pre made RPC calls, without setting up any 
TaskManager. I’m not very familiar with JobMaster code to predict whether this 
approach would be easier/better. Regardless of the used approach just keep in 
mind following things while developing such benchmark:
 * benchmark should actually stress the thing that you want to test. Use code 
profiler to make sure that 90+% of CPU time is being spent in relevant 
JobMaster code and not in TaskManager/ResourceManager mocks or other irrelevant 
components
 * use/re-use as much as possible production code. If you are forced to write 
thousands lines of code to setup a benchmark, something is wrong (usually that 
means the code you are trying to benchmark is untestable as well) and it will 
be difficult to maintain such benchmark.
 * can’t we re-use the code from unit/integration tests? Even if there is no 
TestingTaskExecutor I’m guessing that there must exist some other places that 
test the thing that you want to benchmark? That’s often at least a good place 
to start - convert existing unit/integration tests into benchmarks.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620511#comment-16620511
 ] 

陈梓立 commented on FLINK-10320:
-

We have some cases when our customers set parallelism to over 10,000 because of 
the amount of data to process. In such case, {{JobMaster}} would be even 
unavailable because busy to handle rpc requests or gc. This is the original 
motivation to introduce a benchmark aimed at prevent regression on schedule 
module.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-14 Thread Piotr Nowojski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614412#comment-16614412
 ] 

Piotr Nowojski commented on FLINK-10320:


I meant that for streaming use cases or for batch programs that need more then 
a couple of minutes to complete, scheduling time shouldn't matter. That's why I 
was asking whether you have observed some performance problems in task 
scheduling. If not, then there is also no need for benchmarks. I'm rising this 
point, because it seems to me this would be/is going to be a difficult 
benchmark to write properly.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613148#comment-16613148
 ] 

陈梓立 commented on FLINK-10320:
-

Taking another reconsideration, what it comes to scheduling, there are two main 
topics we are concerned.
 # Correctness, that is, {{JobMaster}} is able to schedule {{ExecuteGraph}}, 
switch it into RUNNING and optional FINISHED while tolerant failure. This is 
out of this thread.
 # Performance, which represent {{JobMaster}}'s ability to react rpc requests, 
during resource requesting, task deploying and execution graph state maintain.

For the performance part, there are two targets we are interested in.
 # How fast the job switch into RUNNING, or say could we start the job faster. 
FLINK-10038
 # What the throughput {{JobMaster}} reacts rpc requests.

The latter target would be another thread discussing how to monitor/metric rpc 
service, and here is about the former.

By offer slots as soon as slot requests arrived and finish task immediately, we 
get rid of the influence of time spent by component except JM. So we measure 
for a certain parallelism and {{JobGraph}}. NOTE THAT with the draft above I am 
not aimed at providing a score here to show that a change improve scheduling 
performance, but aimed at giving a regression sentinel that alert if a relative 
change cause schedule regression.

I'd like to explore and give out a credible SCHEDULE benchmark. [~pnowojski] 
notice you say "Is it time critical thing? I guess that at least in most 
cases/scenarios/setups no.", could you share what is critical of SCHEDULE in 
your opinion? FYI apart from correctness issue we have met {{JobMaster}} 
unavailable due to rpc requests crash it.

Looking forward to you reply : -)

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-12 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612408#comment-16612408
 ] 

陈梓立 commented on FLINK-10320:
-

[~pnowojski] thanks for your quick reply! Seems like my response is a bit 
late(laugh).

The most important benchmark of {{JobMaster}} while scheduling IMO is how fast 
it reacts to rpc requests, which include slot offering, task deploying and task 
state update, fault tolerance and so on.

But measure them apart seems unreasonable since they rely on each other, so 
take the whole schedule process into one benchmark. And, you're right that the 
time of it might not be the most expressive target. Known little about jmh I 
see that the exist benchmark always test how much latency or ops of a given 
benchmark function. ops looks like how many times the whole function execute 
within the given time. I'd appreciate if there are other more exact targets the 
benchmark framework provided.

For implementation part, during a early attempt I am also facing that these 
mock {{TaskExecutor}} implemented not that directly. Rather just provide 
{{TaskExecutorGateway}}, we definitely have to simulate finishing the task so 
that a {{TaskExecutor}} required. There is no {{TestingTaskExecutor}} yet and 
to control how task finish we should override some method of a real 
{{TaskExecutor}}. Also simulate callback actions like heartbeat or task 
actions. It would be reasonable to put tm/actions threads into a thread 
pool(different from that for jm) so that we don't crash the local machine.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10320) Introduce JobMaster schedule micro-benchmark

2018-09-12 Thread Piotr Nowojski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611879#comment-16611879
 ] 

Piotr Nowojski commented on FLINK-10320:


[~Tison] thanks for the proposal. Is the scheduling time a problem? Is it time 
critical thing? I guess that at least in most cases/scenarios/setups no. Did we 
have some performance regression there? I'm asking because more benchmarks is 
more code to maintain and more time to execute them.

My main concern regarding implementation are those mock/testing 
{{TaskExecutor}}. It would be best to either reuse some existing testing code 
or just setup the real ones, but if we want to benchmark scheduling for setups 
with hundreds of TM, they have to be super quick/efficient, other way we would 
overload the machine executing the benchmark.

> Introduce JobMaster schedule micro-benchmark
> 
>
> Key: FLINK-10320
> URL: https://issues.apache.org/jira/browse/FLINK-10320
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)