I did more digging and finally understand what goes wrong. I create a yarn-session with 50 slots. Then I run my job that (due to the fact that my HBase table has 100s of regions) has a lot of inputsplits. The job then runs with parallelism 50 because I did not specify the value. As a consequence the second job I start in the same yarn-session is faced with 0 available task slots and fails with this exception:
08/23/2016 09:58:52 Job execution switched to status FAILING. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: ...... Resources available to scheduler: Number of instances=5, total number of slots=50, available slots=0 So my conclusion for now is that if you want to run batch jobs in yarn-session then you MUST specify the parallelism for all steps or otherwise it will fill the yarn-session completely and you cannot run multiple jobs in parallel. Is this conclusion correct? Niels Basjes On Fri, Aug 19, 2016 at 3:18 PM, Robert Metzger <rmetz...@apache.org> wrote: > Hi Niels, > > In Flink, you don't need one task per file, since splits are assigned > lazily to reading tasks. > What exactly is the error you are getting when trying to read that many > input splits? (Is it on the JobManager?) > > Regards, > Robert > > On Thu, Aug 18, 2016 at 1:56 PM, Niels Basjes <ni...@basjes.nl> wrote: > >> Hi, >> >> I'm working on a batch process using Flink and I ran into an interesting >> problem. >> The number of input splits in my job is really really large. >> >> I currently have a HBase input (with more than 1000 regions) and in the >> past I have worked with MapReduce jobs doing 2000+ files. >> >> The problem I have is that if I run such a job in a "small" yarn-session >> (i.e. less than 1000 tasks) I get a fatal error indicating that there are >> not enough resources. >> For a continuous streaming job this makes sense, yet for a batch job >> (like I'm having) this is an undesirable error. >> >> For my HBase situation I currently have a workaround by overriding the >> creatInputSplits method from the TableInputFormat and thus control the >> input splits that are created. >> >> What is the correct way to solve this (no my cluster is NOT big enough to >> run that many parallel tasks) ? >> >> >> -- >> Best regards / Met vriendelijke groeten, >> >> Niels Basjes >> > > -- Best regards / Met vriendelijke groeten, Niels Basjes