[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore
[ https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170597#comment-17170597 ] Shen Yinjie commented on YARN-9826: --- Is there any progress on this issue? :) > Blocked threads at EntityGroupFSTimelineStore#getCachedStore > > > Key: YARN-9826 > URL: https://issues.apache.org/jira/browse/YARN-9826 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Harunobu Daikoku >Priority: Minor > > We have observed this case several times on our production cluster where 100s > of TimelineServer threads are blocked at the following synchronized block in > EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under > high load. > {code:java} > synchronized (this.cachedLogs) { > // Note that the content in the cache log storage may be stale. > cacheItem = this.cachedLogs.get(groupId); > if (cacheItem == null) { > LOG.debug("Set up new cache item for id {}", groupId); > cacheItem = new EntityCacheItem(groupId, getConfig()); > AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); > if (appLogs != null) { > LOG.debug("Set applogs {} for group id {}", appLogs, groupId); > cacheItem.setAppLogs(appLogs); > this.cachedLogs.put(groupId, cacheItem); > } else { > LOG.warn("AppLogs for groupId {} is set to null!", groupId); > } > } > } > {code} > One thread inside the synchronized block performs multiple fs operations > (fs.exists) inside getAndSetAppLogs, which could block other threads when, > for instance, the NameNode RPC queue is full. > One possible solution is to move getAndSetAppLogs outside the synchronized > block. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore
[ https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934068#comment-16934068 ] Akira Ajisaka commented on YARN-9826: - bq. I don't see any side effects with that change. There are no side effects, but there may be duplicate log creations. I think we can use another lock object to avoid duplicate operations as follows: {code} private final Object fsOpLock = new Object(); (snip) // Note that the content in the cache log storage may be stale. cacheItem = this.cachedLogs.get(groupId); // If the cache already exists, we don't need to hold any locks. if (cacheItem == null) { // Use lock to serialize fs operations synchronized(fsOpLock) { // Recheck cache to avoid duplicate fs operations cacheItem = this.cachedLogs.get(groupId); if (cacheItem == null) { LOG.debug("Set up new cache item for id {}", groupId); cacheItem = new EntityCacheItem(groupId, getConfig()); AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); if (appLogs != null) { LOG.debug("Set applogs {} for group id {}", appLogs, groupId); cacheItem.setAppLogs(appLogs); this.cachedLogs.put(groupId, cacheItem); } else { LOG.warn("AppLogs for groupId {} is set to null!", groupId); } } } } {code} > Blocked threads at EntityGroupFSTimelineStore#getCachedStore > > > Key: YARN-9826 > URL: https://issues.apache.org/jira/browse/YARN-9826 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Harunobu Daikoku >Priority: Minor > > We have observed this case several times on our production cluster where 100s > of TimelineServer threads are blocked at the following synchronized block in > EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under > high load. > {code:java} > synchronized (this.cachedLogs) { > // Note that the content in the cache log storage may be stale. > cacheItem = this.cachedLogs.get(groupId); > if (cacheItem == null) { > LOG.debug("Set up new cache item for id {}", groupId); > cacheItem = new EntityCacheItem(groupId, getConfig()); > AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); > if (appLogs != null) { > LOG.debug("Set applogs {} for group id {}", appLogs, groupId); > cacheItem.setAppLogs(appLogs); > this.cachedLogs.put(groupId, cacheItem); > } else { > LOG.warn("AppLogs for groupId {} is set to null!", groupId); > } > } > } > {code} > One thread inside the synchronized block performs multiple fs operations > (fs.exists) inside getAndSetAppLogs, which could block other threads when, > for instance, the NameNode RPC queue is full. > One possible solution is to move getAndSetAppLogs outside the synchronized > block. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore
[ https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931218#comment-16931218 ] Harunobu Daikoku commented on YARN-9826: [~Prabhu Joseph] Although multiple threads could indeed execute getAndSetAppLogs with the same app id at the same time, I don't see any side effects with that change. AFAIK, the following is the only piece of code which could modify states inside getAndSetAppLogs: {code:java} if (appState != AppState.UNKNOWN) { LOG.debug("Create and try to add new appLogs to appIdLogMap for {}", applicationId); appLogs = createAndPutAppLogsIfAbsent( applicationId, appDirPath, appState); } {code} Apparently createAndPutAppLogsIfAbsent atomically updates appIdLogMap with ConcurrentMap#putIfAbsent(), so this doesn't have to be synchronized on this.cachedLogs. > Blocked threads at EntityGroupFSTimelineStore#getCachedStore > > > Key: YARN-9826 > URL: https://issues.apache.org/jira/browse/YARN-9826 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Harunobu Daikoku >Priority: Minor > > We have observed this case several times on our production cluster where 100s > of TimelineServer threads are blocked at the following synchronized block in > EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under > high load. > {code:java} > synchronized (this.cachedLogs) { > // Note that the content in the cache log storage may be stale. > cacheItem = this.cachedLogs.get(groupId); > if (cacheItem == null) { > LOG.debug("Set up new cache item for id {}", groupId); > cacheItem = new EntityCacheItem(groupId, getConfig()); > AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); > if (appLogs != null) { > LOG.debug("Set applogs {} for group id {}", appLogs, groupId); > cacheItem.setAppLogs(appLogs); > this.cachedLogs.put(groupId, cacheItem); > } else { > LOG.warn("AppLogs for groupId {} is set to null!", groupId); > } > } > } > {code} > One thread inside the synchronized block performs multiple fs operations > (fs.exists) inside getAndSetAppLogs, which could block other threads when, > for instance, the NameNode RPC queue is full. > One possible solution is to move getAndSetAppLogs outside the synchronized > block. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore
[ https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929749#comment-16929749 ] Prabhu Joseph commented on YARN-9826: - [~hdaikoku] When getAndSetAppLogs moved outside, there are chances that multiple threads performs that for same applicationId. > Blocked threads at EntityGroupFSTimelineStore#getCachedStore > > > Key: YARN-9826 > URL: https://issues.apache.org/jira/browse/YARN-9826 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Harunobu Daikoku >Priority: Minor > > We have observed this case several times on our production cluster where 100s > of TimelineServer threads are blocked at the following synchronized block in > EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under > high load. > {code:java} > synchronized (this.cachedLogs) { > // Note that the content in the cache log storage may be stale. > cacheItem = this.cachedLogs.get(groupId); > if (cacheItem == null) { > LOG.debug("Set up new cache item for id {}", groupId); > cacheItem = new EntityCacheItem(groupId, getConfig()); > AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); > if (appLogs != null) { > LOG.debug("Set applogs {} for group id {}", appLogs, groupId); > cacheItem.setAppLogs(appLogs); > this.cachedLogs.put(groupId, cacheItem); > } else { > LOG.warn("AppLogs for groupId {} is set to null!", groupId); > } > } > } > {code} > One thread inside the synchronized block performs multiple fs operations > (fs.exists) inside getAndSetAppLogs, which could block other threads when, > for instance, the NameNode RPC queue is full. > One possible solution is to move getAndSetAppLogs outside the synchronized > block. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore
[ https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929744#comment-16929744 ] Prabhu Joseph commented on YARN-9826: - [~hdaikoku] cachedLogs is a Collections.synchronizedMap. Does synchronization block required while accessing the map. cc [~tarunparimi]. > Blocked threads at EntityGroupFSTimelineStore#getCachedStore > > > Key: YARN-9826 > URL: https://issues.apache.org/jira/browse/YARN-9826 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Harunobu Daikoku >Priority: Minor > > We have observed this case several times on our production cluster where 100s > of TimelineServer threads are blocked at the following synchronized block in > EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under > high load. > {code:java} > synchronized (this.cachedLogs) { > // Note that the content in the cache log storage may be stale. > cacheItem = this.cachedLogs.get(groupId); > if (cacheItem == null) { > LOG.debug("Set up new cache item for id {}", groupId); > cacheItem = new EntityCacheItem(groupId, getConfig()); > AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId()); > if (appLogs != null) { > LOG.debug("Set applogs {} for group id {}", appLogs, groupId); > cacheItem.setAppLogs(appLogs); > this.cachedLogs.put(groupId, cacheItem); > } else { > LOG.warn("AppLogs for groupId {} is set to null!", groupId); > } > } > } > {code} > One thread inside the synchronized block performs multiple fs operations > (fs.exists) inside getAndSetAppLogs, which could block other threads when, > for instance, the NameNode RPC queue is full. > One possible solution is to move getAndSetAppLogs outside the synchronized > block. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org