[ 
https://issues.apache.org/jira/browse/YARN-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-547:
-----------------------------------

    Attachment: yarn-547-20130416.patch

I have added separate test cases for public and private localizer. With little 
modification it should fail for earlier implementation.
For Both the the test cases
* first I am setting up required ResourceLocalizationService, other dispatchers 
and directory handlers.
* Container-1 makes a request for the resource-1. Request is handled and 
resource download starts (resource state and semaphore count is verified)
* Now Container-2 makes the request for the same resource while resource is 
getting downloaded. Request is rejected because lock on the resource can not be 
acquired.
* Resource-1 download fails, it is transitioned into LOCALIZATION_FAILED state 
and semaphore lock is unlocked.
* Now container-3 makes the request to download same resource. This request is 
rejected because even though it can acquire lock still the resource is no 
longer in DOWNLOADING state.

                
> Race condition in Public / Private Localizer may result into resource getting 
> downloaded again
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-547
>                 URL: https://issues.apache.org/jira/browse/YARN-547
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Omkar Vinit Joshi
>            Assignee: Omkar Vinit Joshi
>         Attachments: yarn-547-20130411.1.patch, yarn-547-20130411.patch, 
> yarn-547-20130412.patch, yarn-547-20130415.patch, yarn-547-20130416.patch
>
>
> Public Localizer :
> At present when multiple containers try to request a localized resource 
> * If the resource is not present then first it is created and Resource 
> Localization starts ( LocalizedResource is in DOWNLOADING state)
> * Now if in this state multiple ResourceRequestEvents arrive then 
> ResourceLocalizationEvents are sent for all of them.
> Most of the times it is not resulting into a duplicate resource download but 
> there is a race condition present there. Inside ResourceLocalization (for 
> public download) all the requests are added to local attempts map. If a new 
> request comes in then first it is checked in this map before a new download 
> starts for the same. For the current download the request will be there in 
> the map. Now if a same resource request comes in then it will rejected (i.e. 
> resource is getting downloaded already). However if the current download 
> completes then the request will be removed from this local map. Now after 
> this removal if the LocalizerRequestEvent comes in then as it is not present 
> in local map the resource will be downloaded again.
> PrivateLocalizer :
> Here a different but similar race condition is present.
> * Here inside findNextResource method call; each LocalizerRunner tries to 
> grab a lock on LocalizerResource. If the lock is not acquired then it will 
> keep trying until the resource state changes to LOCALIZED. This lock will be 
> released by the LocalizerRunner when download completes.
> * Now if another ContainerLocalizer tries to grab the lock on a resource 
> before LocalizedResource state changes to LOCALIZED then resource will be 
> downloaded again.
> At both the places the root cause of this is that all the threads try to 
> acquire the lock on resource however current state of the LocalizedResource 
> is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to