Hi, Flink's operators are designed to work in memory as long as possible and spill to disk once the memory budget is exceeded. Moreover, Flink aims to run programs in a pipelined fashion, such that multiple operators can process data at the same time. This behavior can make it a bit tricky to analyze the runtime behavior and progress of operators.
It would be interesting to have a look at the execution plan for the program that you are running. The plan can be obtained from the ExecutionEnvironment by calling env.getExecutionPlan() instead of env.execute(). I would also like to know how you track the progress of the program. Are you looking at the record counts displayed in the WebUI? Best, Fabian 2017-12-05 22:03 GMT+01:00 Garrett Barton <garrett.bar...@gmail.com>: > I have been moving some old MR and hive workflows into Flink because I'm > enjoying the api's and the ease of development is wonderful. Things have > largely worked great until I tried to really scale some of the jobs > recently. > > I have for example one etl job that reads in about 12B records at a time > and does a sort, some simple transformations, validation, a re-partition > and then output to a hive table. > When I built it with the sample set, ~200M, it worked great, took maybe a > minute and blew threw it. > > What I have observed is there is some kind of saturation reached depending > on number of slots, number of nodes and the overall size of data to move. > When I run the 12B set, the first 1B go through in under 1 minute, really > really fast. But its an extremely sharp drop off after that, the next 1B > might take 15 minutes, and then if I wait for the next 1B, its well over an > hour. > > What I cant find is any obvious indicators or things to look at, > everything just grinds to a halt, I don't think the job would ever actually > complete. > > Is there something in the design of flink in batch mode that is perhaps > memory bound? Adding more nodes/tasks does not fix it, just gets me a > little further along. I'm already running around ~1,400 slots at this > point, I'd postulate needing 10,000+ to potentially make the job run, but > thats too much of my cluster gone, and I have yet to get flink to be stable > past 1,500. > > Any idea's on where to look, or what to debug? GUI is also very > cumbersome to use at this slot count too, so other measurement ideas are > welcome too! > > Thank you all. >