[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825989#comment-13825989 ]
Bikas Saha commented on YARN-744: --------------------------------- Better name? {code} + AllocateResponseLock res = responseMap.get(applicationAttemptId); {code} reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()? use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()? {code} + private void throwApplicationAttemptDoesNotExistInCacheException( + ApplicationAttemptId appAttemptId) throws YarnException { + String message = "Application doesn't exist in cache " + + appAttemptId; + LOG.error(message); + throw RPCUtil.getRemoteException(message); + } {code} The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response. {code} - // before returning response, verify in sync - AllocateResponse oldResponse = - responseMap.put(appAttemptId, allocateResponse); - if (oldResponse == null) { - // appAttempt got unregistered, remove it back out - responseMap.remove(appAttemptId); - String message = "App Attempt removed from the cache during allocate" - + appAttemptId; - LOG.error(message); - return resync; - } - + res.setAllocateResponse(allocateResponse); return allocateResponse; {code} > Race condition in ApplicationMasterService.allocate .. It might process same > allocate request twice resulting in additional containers getting allocated. > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-744 > URL: https://issues.apache.org/jira/browse/YARN-744 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Omkar Vinit Joshi > Priority: Minor > Attachments: MAPREDUCE-3899-branch-0.23.patch, > YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, > YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.patch > > > Looks like the lock taken in this is broken. It takes a lock on lastResponse > object and then puts a new lastResponse object into the map. At this point a > new thread entering this function will get a new lastResponse object and will > be able to take its lock and enter the critical section. Presumably we want > to limit one response per app attempt. So the lock could be taken on the > ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)