Re: Correlation between data streams/operators and threads

Shailesh Jain Thu, 09 Nov 2017 10:16:53 -0800

On 1. - is it tied specifically to the number of source operators or to the
number of Datastream objects created. I mean does the answer change if I
read all the data from a single Kafka topic, get a Datastream of all
events, and the apply N filters to create N individual streams?


On 3. - the problem with partitions is that watermarks cannot be different
per partition, and since in this use case, each stream is from a device,
the latency could be different (but order will be correct almost always)
and there are high chances of loosing out on events on operators like
Patterns which work with windows. Any ideas for workarounds here?


Thanks,
Shailesh

On 09-Nov-2017 8:48 PM, "Piotr Nowojski" <pi...@data-artisans.com> wrote:

Hi,

1.
https://ci.apache.org/projects/flink/flink-docs-
release-1.3/dev/parallel.html

Number of threads executing would be roughly speaking equal to of the
number of input data streams multiplied by the parallelism.

2.
Yes, you could dynamically create more data streams at the job startup.

3.
Running 10000 independent data streams on a small cluster (couple of nodes)
will definitely be an issue, since even with parallelism set to 1, there
would be quite a lot of unnecessary threads.

It would be much better to treat your data as a single data input stream
with multiple partitions. You could assign partitions between source
instances based on parallelism. For example with parallelism 6:
- source 0 could get partitions 0, 6, 12, 18
- source 1, could get partitions 1, 7, …
…
- source 5, could get partitions 5, 11, ...

Piotrek

On 9 Nov 2017, at 10:18, Shailesh Jain <shailesh.j...@stellapps.com> wrote:

Hi,

I'm trying to understand the runtime aspect of Flink when dealing with
multiple data streams and multiple operators per data stream.

Use case: N data streams in a single flink job (each data stream
representing 1 device - with different time latencies), and each of these
data streams gets split into two streams, of which one goes into a bunch of
CEP operators, and one into a process function.

Questions:
1. At runtime, will the engine create one thread per data stream? Or one
thread per operator?
2. Is it possible to dynamically create a data stream at runtime when the
job starts? (i.e. if N is read from a file when the job starts and
corresponding N streams need to be created)
3. Are there any specific performance impacts when a large number of
streams (N ~ 10000) are created, as opposed to N partitions within a single
stream?

Are there any internal (design) documents which can help understanding the
implementation details? Any references to the source will also be really
helpful.

Thanks in advance.

Shailesh

Re: Correlation between data streams/operators and threads

Reply via email to