Hi Taher,

Stream processing and batch processing are very different. The principle of
batch processing determines that it needs to process bulk data, such as
memory-based sorting, join, and so on. So, in this case, it needs to wait
for the relevant data to arrive before it is calculated, but this does not
mean that the data is concentrated in one node, and the calculation is
still distributed. Flink has corresponding optimization measures for the
execution plan of batch processing. For the storage of large data sets, it
uses a custom-managed memory mechanism (you can use more memory by applying
extra-heap memory). Of course, the amount of data is still stored in the
memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only
applicable to stream processing. Batch fault tolerance can be re-executed
by directly playing back the complete data set. A TaskManager fails, Flink
will kick it out of the cluster, and the Task running on it will fail, but
the result of stream processing and batch Task failure is different. For
stream processing, it triggers a restart of the entire job, which may only
trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <taher.koitaw...@gslab.com> 于2018年9月12日周三 上午1:50写道:

> Furthermore, how does Flink deal with Task Managers dying when it is using
> the DataSet API. Is checkpointing done on dataset too? Or the whole dataset
> has to re-read.
>
> Regards,
> Taher Koitawala
> GS Lab Pune
> +91 8407979163
>
> On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <
> taher.koitaw...@gslab.com> wrote:
>
>> Hi All,
>>          Just like Spark does Flink read a dataset and keep it in memory
>> and keep applying transformations? Or all records read by Flink async
>> parallel reads? Furthermore, how does Flink deal with
>>
>> Regards,
>> Taher Koitawala
>> GS Lab Pune
>> +91 8407979163
>>
>
>

Reply via email to