[ https://issues.apache.org/jira/browse/YARN-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083015#comment-14083015 ]
Kannan Rajah commented on YARN-547: ----------------------------------- Just realized that the FetchResourceTransition adds the container to the reference list. So we need to retain that transition. Omkar's patch from April 11 added a duplicate transition that just updates the reference list. But he later reverted that change on April 13th. I didn't understand as to how this specific changes impacts parallelism. > Race condition in Public / Private Localizer may result into resource getting > downloaded again > ---------------------------------------------------------------------------------------------- > > Key: YARN-547 > URL: https://issues.apache.org/jira/browse/YARN-547 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Omkar Vinit Joshi > Assignee: Omkar Vinit Joshi > Fix For: 2.1.0-beta > > Attachments: yarn-547-20130411.1.patch, yarn-547-20130411.patch, > yarn-547-20130412.patch, yarn-547-20130415.patch, yarn-547-20130416.1.patch, > yarn-547-20130416.patch, yarn-547-20130418.patch > > > Public Localizer : > At present when multiple containers try to request a localized resource > * If the resource is not present then first it is created and Resource > Localization starts ( LocalizedResource is in DOWNLOADING state) > * Now if in this state multiple ResourceRequestEvents arrive then > ResourceLocalizationEvents are sent for all of them. > Most of the times it is not resulting into a duplicate resource download but > there is a race condition present there. Inside ResourceLocalization (for > public download) all the requests are added to local attempts map. If a new > request comes in then first it is checked in this map before a new download > starts for the same. For the current download the request will be there in > the map. Now if a same resource request comes in then it will rejected (i.e. > resource is getting downloaded already). However if the current download > completes then the request will be removed from this local map. Now after > this removal if the LocalizerRequestEvent comes in then as it is not present > in local map the resource will be downloaded again. > PrivateLocalizer : > Here a different but similar race condition is present. > * Here inside findNextResource method call; each LocalizerRunner tries to > grab a lock on LocalizerResource. If the lock is not acquired then it will > keep trying until the resource state changes to LOCALIZED. This lock will be > released by the LocalizerRunner when download completes. > * Now if another ContainerLocalizer tries to grab the lock on a resource > before LocalizedResource state changes to LOCALIZED then resource will be > downloaded again. > At both the places the root cause of this is that all the threads try to > acquire the lock on resource however current state of the LocalizedResource > is not taken into consideration. -- This message was sent by Atlassian JIRA (v6.2#6252)