Re: Correlation between data streams/operators and threads

Piotr Nowojski Thu, 09 Nov 2017 07:19:14 -0800

Hi,

1. 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html 
<https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html>


Number of threads executing would be roughly speaking equal to of the number of 
input data streams multiplied by the parallelism.

2. 
Yes, you could dynamically create more data streams at the job startup.

3.
Running 10000 independent data streams on a small cluster (couple of nodes) 
will definitely be an issue, since even with parallelism set to 1, there would 
be quite a lot of unnecessary threads. 

It would be much better to treat your data as a single data input stream with 
multiple partitions. You could assign partitions between source instances based 
on parallelism. For example with parallelism 6:
- source 0 could get partitions 0, 6, 12, 18
- source 1, could get partitions 1, 7, …
…
- source 5, could get partitions 5, 11, ...

Piotrek

> On 9 Nov 2017, at 10:18, Shailesh Jain <shailesh.j...@stellapps.com> wrote:
> 
> Hi,
> 
> I'm trying to understand the runtime aspect of Flink when dealing with 
> multiple data streams and multiple operators per data stream.
> 
> Use case: N data streams in a single flink job (each data stream representing 
> 1 device - with different time latencies), and each of these data streams 
> gets split into two streams, of which one goes into a bunch of CEP operators, 
> and one into a process function.
> 
> Questions:
> 1. At runtime, will the engine create one thread per data stream? Or one 
> thread per operator?
> 2. Is it possible to dynamically create a data stream at runtime when the job 
> starts? (i.e. if N is read from a file when the job starts and corresponding 
> N streams need to be created)
> 3. Are there any specific performance impacts when a large number of streams 
> (N ~ 10000) are created, as opposed to N partitions within a single stream?
> 
> Are there any internal (design) documents which can help understanding the 
> implementation details? Any references to the source will also be really 
> helpful.
> 
> Thanks in advance.
> 
> Shailesh
> 
>

Re: Correlation between data streams/operators and threads

Reply via email to