Re: General Data questions - streams vs batch

Aljoscha Krettek Mon, 25 Apr 2016 07:55:22 -0700

Hi,
I'll try and answer your questions separately. First, a general remark,
although Flink has the DataSet API for batch processing and the DataStream
API for stream processing we only have one underlying streaming execution
engine that is used for both. Now, regarding the questions:

1) What do you mean by "parallel into 2 streams"? Maybe that could
influence my answer but I'll just give a general answer: Flink does not
give any guarantees about the ordering of elements in a Stream or in a
DataSet. This means that merging or unioning two streams/data sets will
just mean that operations see all elements in the two merged streams but
the order in which we see them is arbitrary. This means that we don't keep
buffers based on time or size or anything.

2) The elements that flow through the topology are not tracked
individually, each operation just receives elements, updates state and
sends elements to downstream operation. In essence this means that elements
themselves don't block any resources except if they alter some kept state
in operations. If you have a stateless pipeline that only has
filters/maps/flatMaps then the amount of required resources is very low.

For a finite data set, elements are also streamed through the topology.
Only if you use operations that require grouping or sorting (such as
groupBy/reduce and join) will elements be buffered in memory or on disk
before they are processed.

Two answer your last question. If you only do stateless
transformations/filters then you are fine to use either API and the
performance should be similar.

Cheers,
Aljoscha

On Sun, 24 Apr 2016 at 15:54 Konstantin Kulagin <kkula...@gmail.com> wrote:

> Hi guys,
>
> I have some kind of general question in order to get more understanding of
> stream vs final data transformation. More specific - I am trying to
> understand 'entities' lifecycle during processing.
>
> 1) For example in a case of streams: suppose we start with some key-value
> source, parallel it into 2 streams by key. Each stream modifies entry's
> values, lets say adds some fields. And we want to merge it back later. How
> does it happen?
> Merging point will keep some finite buffer of entries? Basing on time or
> size?
>
> I understand that probably right solution in this case would be having one
> stream and achieve more more performance by increasing parallelism, but
> what if I have 2 sources from the beginning?
>
>
> 2) Also I assume that in a case of streaming each entry considered as
> 'processed' once it passes whole chain and emitted into some sink, so after
> it will not consume resources. Basically similar to what Storm is doing.
> But in a case of finite data (data sets): how big amount of data system
> will keep in memory? The whole set?
>
> I probably have some example of dataset vs stream 'mix': I need to
> *transform* big but finite chunk of data, I don't really need to do any
> 'joins', grouping or smth like that so I never need to store whole dataset
> in memory/storage. What my choice would be in this case?
>
> Thanks!
> Konstantin
>
>
>

Re: General Data questions - streams vs batch

Reply via email to