Failed job restart - flink on yarn

vpra...@gmail.com Fri, 01 Jul 2016 14:16:03 -0700

Hi,

I have a flink streaming job that reads from kafka, performs a aggregation
in a window, it ran fine for a while however when the number of events in a
window crossed a certain limit , the yarn containers failed with Out Of
Memory. The job was running with 10G containers.


We have about 64G memory on the machine and now I want to restart the job 
with a 20G container (we ran some tests and 20G should be good enough to
accomodate all the elements from the window).

Is there a way to restart the job from the last checkpoint ? 

When I resubmit the job, it starts from the last committed offsets however
the events that were held in the window at the time of checkpointing seem to
get lost. Is there a way to recover the events buffered within the window
and were checkpointed before the failure ?

Thanks,
Prabhu



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Failed job restart - flink on yarn

Reply via email to