lonerzzz commented on issue #11042: FLINK-15744 Some TaskManager Task 
exceptions are logged as info
URL: https://github.com/apache/flink/pull/11042#issuecomment-586757018
 
 
   @zentol @aljoscha Upon reading the issue #5399, it didn't seem that any firm 
position was taken on the issue. The reference to setting JobManager output to 
log at the info level assumes an ability to recover. This is not true in all 
cases. Two situations that I have encountered are those from which recovery 
does not occur or occurs slowly:
   
   1) Job submission failure - there are many errors from which the submission 
will not recover without manual intervention. By forcing JobManager output to 
log at the info level, the JobManager must always be run with info level 
logging for situations where jobs are regularly submitted or the errors will 
not be visible.
   2) Rebalancing errors - several situations that I have encountered where the 
number of task slots is close to the number of tasks can result in jobs that 
are stuck awaiting deployment and rebalancing for very long periods of time in 
the event of a transient infrastructure error. While recovery may happen, it 
can take a while and a warning would at least allow operations staff to take 
manual action to correct things rather than finding out that a job in a 
pipeline is not processing because it is awaiting resources.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to