Hi,

Flink's operators are designed to work in memory as long as possible and
spill to disk once the memory budget is exceeded.
Moreover, Flink aims to run programs in a pipelined fashion, such that
multiple operators can process data at the same time.
This behavior can make it a bit tricky to analyze the runtime behavior and
progress of operators.

It would be interesting to have a look at the execution plan for the
program that you are running.
The plan can be obtained from the ExecutionEnvironment by calling
env.getExecutionPlan() instead of env.execute().

I would also like to know how you track the progress of the program.
Are you looking at the record counts displayed in the WebUI?

Best,
Fabian



2017-12-05 22:03 GMT+01:00 Garrett Barton <garrett.bar...@gmail.com>:

> I have been moving some old MR and hive workflows into Flink because I'm
> enjoying the api's and the ease of development is wonderful.  Things have
> largely worked great until I tried to really scale some of the jobs
> recently.
>
> I have for example one etl job that reads in about 12B records at a time
> and does a sort, some simple transformations, validation, a re-partition
> and then output to a hive table.
> When I built it with the sample set, ~200M, it worked great, took maybe a
> minute and blew threw it.
>
> What I have observed is there is some kind of saturation reached depending
> on number of slots, number of nodes and the overall size of data to move.
> When I run the 12B set, the first 1B go through in under 1 minute, really
> really fast.  But its an extremely sharp drop off after that, the next 1B
> might take 15 minutes, and then if I wait for the next 1B, its well over an
> hour.
>
> What I cant find is any obvious indicators or things to look at,
> everything just grinds to a halt, I don't think the job would ever actually
> complete.
>
> Is there something in the design of flink in batch mode that is perhaps
> memory bound?  Adding more nodes/tasks does not fix it, just gets me a
> little further along.  I'm already running around ~1,400 slots at this
> point, I'd postulate needing 10,000+ to potentially make the job run, but
> thats too much of my cluster gone, and I have yet to get flink to be stable
> past 1,500.
>
> Any idea's on where to look, or what to debug?  GUI is also very
> cumbersome to use at this slot count too, so other measurement ideas are
> welcome too!
>
> Thank you all.
>

Reply via email to