Daryn Sharp commented on YARN-3055:

The renew at job submission isn't the problem.  It's actually very desirable.  
Years back, a job submitted with bad tokens - that was destined to fail - would 
be launched anyway.  The tasks failed to connect, ipc level retries occurred, 
then higher level retries occurred, and yarn generally caught all exceptions 
and retried.  Tasks were retried, perhaps the app attempt was retried, etc.  In 
the end, a job that _clearly was going to fail_ might tie up cluster resources 
for 20+ minutes.  Why was it launched when a failed renew could have prevented 
the problem?  Not to mention the renewer was hardcoded to assume the expiration 
interval was 24h...  So much for being able to stress test the renewer with <1m 

The potential DOS problem is when a token has reached end of life expiration.  
Let's say the token can be renewed twice.    The third and subsequent renews 
return the same expiration.
# t1 = submit + renew
# t2 = t1 + renew
# t3 = t2
# t4 = t2

The renew timers fire 90% of the delta between now and the next expiration.  So 
as end of life expiration approaches, the timer fires with an increasing 
frequency.  50 threads doing that virtually non-stop would not be pretty.  The 
solution is stop renewing when the next expiration equals the last expiration.  
That can be addressed in another jira that's not a blocker because if tokens 
aren't renewed forever then it's a rare situation.

> The token is not renewed properly if it's shared by jobs (oozie) in 
> DelegationTokenRenewer
> ------------------------------------------------------------------------------------------
>                 Key: YARN-3055
>                 URL: https://issues.apache.org/jira/browse/YARN-3055
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: security
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Blocker
>         Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch
> After YARN-2964, there is only one timer to renew the token if it's shared by 
> jobs. 
> In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
> token is shared by other jobs, we will not cancel the token. 
> Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
> from {{allTokens}}. Otherwise for the existing submitted applications which 
> share this token will not get renew any more, and for new submitted 
> applications which share this token, the token will be renew immediately.
> For example, we have 3 applications: app1, app2, app3. And they share the 
> token1. See following scenario:
> *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
> there is only one token renewal timer for token1, and is scheduled when app1 
> is submitted
> *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
> be renewed any more, but app2 and app3 still use it, so there is problem.

This message was sent by Atlassian JIRA

Reply via email to