Re: Flink Batch Performance degradation at scale

Fabian Hueske Tue, 05 Dec 2017 14:38:25 -0800

Hi,

Flink's operators are designed to work in memory as long as possible and
spill to disk once the memory budget is exceeded.
Moreover, Flink aims to run programs in a pipelined fashion, such that
multiple operators can process data at the same time.
This behavior can make it a bit tricky to analyze the runtime behavior and
progress of operators.


It would be interesting to have a look at the execution plan for the
program that you are running.
The plan can be obtained from the ExecutionEnvironment by calling
env.getExecutionPlan() instead of env.execute().

I would also like to know how you track the progress of the program.
Are you looking at the record counts displayed in the WebUI?

Best,
Fabian



2017-12-05 22:03 GMT+01:00 Garrett Barton <garrett.bar...@gmail.com>:

> I have been moving some old MR and hive workflows into Flink because I'm
> enjoying the api's and the ease of development is wonderful.  Things have
> largely worked great until I tried to really scale some of the jobs
> recently.
>
> I have for example one etl job that reads in about 12B records at a time
> and does a sort, some simple transformations, validation, a re-partition
> and then output to a hive table.
> When I built it with the sample set, ~200M, it worked great, took maybe a
> minute and blew threw it.
>
> What I have observed is there is some kind of saturation reached depending
> on number of slots, number of nodes and the overall size of data to move.
> When I run the 12B set, the first 1B go through in under 1 minute, really
> really fast.  But its an extremely sharp drop off after that, the next 1B
> might take 15 minutes, and then if I wait for the next 1B, its well over an
> hour.
>
> What I cant find is any obvious indicators or things to look at,
> everything just grinds to a halt, I don't think the job would ever actually
> complete.
>
> Is there something in the design of flink in batch mode that is perhaps
> memory bound?  Adding more nodes/tasks does not fix it, just gets me a
> little further along.  I'm already running around ~1,400 slots at this
> point, I'd postulate needing 10,000+ to potentially make the job run, but
> thats too much of my cluster gone, and I have yet to get flink to be stable
> past 1,500.
>
> Any idea's on where to look, or what to debug?  GUI is also very
> cumbersome to use at this slot count too, so other measurement ideas are
> welcome too!
>
> Thank you all.
>

Re: Flink Batch Performance degradation at scale

Reply via email to