[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059858#comment-15059858 ] Hudson commented on MAPREDUCE-6436: --- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #698 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/698/]) Update CHANGES.txt to move MAPREDUCE-6436 from YARN to MAPREDUCE (zxu: rev 7092d47fc0b3b792dd31f967c01d460dc089f60b) * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/CHANGES.txt > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057625#comment-15057625 ] zhihai xu commented on MAPREDUCE-6436: -- Thanks [~djp] and [~lewuathe]! I changed it to a blocker, because it may let more people notice this potential performance issue. +1 for the latest patch. Will commit it shortly. > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057703#comment-15057703 ] zhihai xu commented on MAPREDUCE-6436: -- Committed it to trunk, branch-2, branch-2.6 and branch-2.7! Thanks [~lewuathe] for the contributions! Thanks [~djp] for the additional review! > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057727#comment-15057727 ] Hudson commented on MAPREDUCE-6436: --- FAILURE: Integrated in Hadoop-trunk-Commit #8968 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8968/]) MAPREDUCE-6436. JobHistory cache issue. Contributed by Kai Sasaki (zxu: rev 5b7078d06921893200163a3d29c8901c3c0107cb) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java * hadoop-yarn-project/CHANGES.txt > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058917#comment-15058917 ] zhihai xu commented on MAPREDUCE-6436: -- Thanks for the finding [~aw], Just know we branched out 2.8. Will commit it to branch-2.8 shortly. > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058744#comment-15058744 ] Allen Wittenauer commented on MAPREDUCE-6436: - bq. Committed it to trunk, branch-2, branch-2.6 and branch-2.7! You missed branch-2.8... > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059254#comment-15059254 ] Kai Sasaki commented on MAPREDUCE-6436: --- [~zxu] Thank you so much! Do we need to create another JIRA as a follow up of optimization of JobHistoryServer RPC to implement below? {quote} Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are not blocked by a loop at scale of tens of thousands. {quote} > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059548#comment-15059548 ] Hudson commented on MAPREDUCE-6436: --- FAILURE: Integrated in Hadoop-trunk-Commit #8973 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8973/]) Update CHANGES.txt to move MAPREDUCE-6436 from YARN to MAPREDUCE (zxu: rev 7092d47fc0b3b792dd31f967c01d460dc089f60b) * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/CHANGES.txt > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059386#comment-15059386 ] Kai Sasaki commented on MAPREDUCE-6436: --- [~zxu] Thanks for clarifying. I created another JIRA for reducing unnecessary call of scanIntermediateDirectory. MAPREDUCE-6573 Please discuss on the JIRA. If it is not necessary, it can be voided. Thanks anyway. > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058061#comment-15058061 ] Hudson commented on MAPREDUCE-6436: --- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #694 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/694/]) MAPREDUCE-6436. JobHistory cache issue. Contributed by Kai Sasaki (zxu: rev 5b7078d06921893200163a3d29c8901c3c0107cb) * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059349#comment-15059349 ] zhihai xu commented on MAPREDUCE-6436: -- Thanks [~lewuathe] for suggestion! There is a task {{MoveIntermediateToDoneRunnable}} which will call scanIntermediateDirectory periodically. So most time the job will be found in the cache {{jobListCache}}. Also making scanIfNeeded asynchronous may change the functionality in RPC calls: cannot find the job information which can be found before. I think about the other way to improve the performance which can decrease the times to call scanIntermediateDirectory: In getFileInfo, add scanOldDirsForJob before scanIntermediateDirectory, which means calling scanOldDirsForJob twice: one is before scanIntermediateDirectory, the other is after scanIntermediateDirectory. {code} public HistoryFileInfo getFileInfo(JobId jobId) throws IOException { // FileInfo available in cache. HistoryFileInfo fileInfo = jobListCache.get(jobId); if (fileInfo != null) { return fileInfo; } // call scanOldDirsForJob before scanIntermediateDirectory fileInfo = scanOldDirsForJob(jobId); if (fileInfo != null) { return fileInfo; } // OK so scan the intermediate to be sure we did not lose it that way scanIntermediateDirectory(); fileInfo = jobListCache.get(jobId); if (fileInfo != null) { return fileInfo; } // Intermediate directory does not contain job. Search through older ones. fileInfo = scanOldDirsForJob(jobId); if (fileInfo != null) { return fileInfo; } return null; } {code} > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059594#comment-15059594 ] zhihai xu commented on MAPREDUCE-6436: -- Just committed it to branch-2.8! > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki >Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057236#comment-15057236 ] Hadoop QA commented on MAPREDUCE-6436: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 5m 32s {color} | {color:red} hadoop-mapreduce-client-hs in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 58s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s {color} | {color:red} Patch generated 14 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 25m 38s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.mapreduce.v2.hs.TestHistoryFileManager | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12777649/MAPREDUCE-6436.4.patch | | JIRA Issue | MAPREDUCE-6436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux ca50a68ad968 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057241#comment-15057241 ] Kai Sasaki commented on MAPREDUCE-6436: --- [~djp] Yes, as described above, scanIfNeeded slowness makes HistroyClientServer.HDSClientProtocolHandler.getJobReport slow that is called from job client. In some cases, it causes a performance issue of the job. But usually retuned from JobListCached retained by HistoryFileManager in this case scanIntermediateDirectory won't be required. So we cannot say that the performance issue is occurred immediately if there are a lot of failed and pending job logs in intermediate directory. I'm not sure we should set the JIRA as a blocker or not though... > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056154#comment-15056154 ] Junping Du commented on MAPREDUCE-6436: --- I think the impact of this issue could be more severe than our description above: "In addition to large disk consumption, this issue blocks JobHistory.getJob() long time and slows job execution down significantly because getJob is called by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When multiple threads call scanIfNeeded simultaneously, one of them acquires lock and the other threads are blocked until the first thread completes long-running HistoryFileManager.addIfAbsent call. " It could cause JHS serious OOM because REST call of getJobs() could get blocked with some getJob() while unexpected caching other completedJob() in previous calls. Isn' it? [~zxu] and [~lewuathe], may be we should set this JIRA as a blocker for 2.6.4 and 2.7.3? > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, stacktrace1.txt, stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055461#comment-15055461 ] zhihai xu commented on MAPREDUCE-6436: -- Thanks for updating the patch [~lewuathe]! the new patch looks good except the checkstyle issue. {code} ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java:265: Line is longer than 80 characters (found 97). ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java:267: Line is longer than 80 characters (found 102). ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java:268: Line is longer than 80 characters (found 118). ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java:271: Line is longer than 80 characters (found 94). ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java:272: Line is longer than 80 characters (found 114). {code} Could you fix the above checkstyle issue? > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, stacktrace1.txt, stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055199#comment-15055199 ] zhihai xu commented on MAPREDUCE-6436: -- [~lewuathe], thanks for working on this issue. About the patch, We don't need to calculate the count for the entries being removed. Can we do all the calculations in the {{else}} section: {code} if(firstValue.didMoveFail() && firstValue.jobIndexInfo.getFinishTime() <= cutoff) { ... } else { if (firstValue.didMoveFail()) { if (moveFailedCount == 0) { firstMoveFailedKey = key; } moveFailedCount += 1; } else { if (inIntermediateCount == 0) { firstInIntermediateKey = key; } inIntermediateCount += 1; } } {code} > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Kai Sasaki > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > stacktrace1.txt, stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055308#comment-15055308 ] Hadoop QA commented on MAPREDUCE-6436: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs (total was 16, now 21). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 55s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 12s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 23s {color} | {color:red} Patch generated 14 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 27m 42s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12777379/MAPREDUCE-6436.3.patch | | JIRA Issue | MAPREDUCE-6436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 0098dd90cfb6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050335#comment-15050335 ] Kai Sasaki commented on MAPREDUCE-6436: --- [~zxu] Sorry for bothering you again. Could you review this? > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > stacktrace1.txt, stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035354#comment-15035354 ] Kai Sasaki commented on MAPREDUCE-6436: --- [~zxu] Hello, Zhihai. I'm sorry for late for responding. [~ryu_kobayashi] asked me to take over this JIRA. So I'll update current patch soon. Thank you. > JobHistory cache issue > -- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi > Attachments: MAPREDUCE-6436.1.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 5 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 2, > HistoryFileManager.addIfAbsent > method produces 5 - 2 = 3 lines of "Waiting to remove from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035392#comment-15035392 ] Hadoop QA commented on MAPREDUCE-6436: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 10s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs (total was 16, now 17). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 48s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 12s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 22s {color} | {color:red} Patch generated 14 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 27m 12s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12745774/MAPREDUCE-6436.1.patch | | JIRA Issue | MAPREDUCE-6436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux d5be684a9740 3.13.0-36-lowlatency
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035395#comment-15035395 ] Hadoop QA commented on MAPREDUCE-6436: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 10s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs (total was 16, now 21). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 17s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 19s {color} | {color:green} hadoop-mapreduce-client-hs in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 24s {color} | {color:red} Patch generated 14 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 28m 53s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12775224/MAPREDUCE-6436.2.patch | | JIRA Issue | MAPREDUCE-6436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 421014cad788 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632630#comment-14632630 ] zhihai xu commented on MAPREDUCE-6436: -- Thanks for working on this issue [~ryu_kobayashi]! It looks like the log will only be printed for the HistoryFileInfo at state {{IN_INTERMEDIATE}} or {{MOVE_FAILED}}. I think most of HistoryFileInfo should be at state {{IN_DONE}}. bq. HistoryFileManager.addIfAbsent method produces 5 - 2 = 3 lines of Waiting to remove key from JobListCache because it is not in done yet message The above statement may not be a valid case unless you have a performance issue at HDFS which cause {{HistoryFileInfo#moveToDone}} take very long time. The direct cause for your issue may be a HDFS performance issue. But we can improve the logs to print less message. About your patch, Changing {{scanIfNeeded}} to nonblocking may not be good because the following code at {{HistoryFileManager#getFileInfo}} expects {{jobListCache}} has the entry for the given job after {{scanIntermediateDirectory}} returns, which need block {{scanIfNeeded}}. {code} // OK so scan the intermediate to be sure we did not lose it that way scanIntermediateDirectory(); fileInfo = jobListCache.get(jobId); if (fileInfo != null) { return fileInfo; } {code} Also the implementation of {{scanIfNeeded}} will make sure {{ scanIntermediateDirectory(p);}} will only be called once. {code} if (modTime != newModTime) { Path p = fs.getPath(); try { scanIntermediateDirectory(p); //If scanning fails, we will scan again. We assume the failure is // temporary. modTime = newModTime; } catch (IOException e) { LOG.error(Error while trying to scan the directory + p, e); } } else { if (LOG.isDebugEnabled()) { LOG.debug(Scan not needed of + fs.getPath()); } } {code} So the performance overhead for {{scanIfNeeded}} won't be that much. We can make a patch to print less log message. The following logs are printed for HistoryFileInfo at both {{IN_INTERMEDIATE}} state and {{MOVE_FAILED}} state, Can we add two counters: one for {{IN_INTERMEDIATE}} and the other one for {{MOVE_FAILED}}? Also we can save the first key for HistoryFileInfo at state {{IN_INTERMEDIATE}} and the first key for HistoryFileInfo at state {{MOVE_FAILED}}, print these two keys in the logs. {code} } else { LOG.warn(Waiting to remove + key + from JobListCache because it is not in done yet.); } {code} JobHistory cache issue -- Key: MAPREDUCE-6436 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Attachments: MAPREDUCE-6436.1.patch, stacktrace1.txt, stacktrace2.txt, stacktrace3.txt Problem: HistoryFileManager.addIfAbsent produces large amount of logs if number of cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes larger than mapreduce.jobhistory.joblist.cache.size by far. Example: For example, if the cache contains 5 entries in total and 10,000 entries newer than mapreduce.jobhistory.max-age-ms where mapreduce.jobhistory.joblist.cache.size is 2, HistoryFileManager.addIfAbsent method produces 5 - 2 = 3 lines of Waiting to remove key from JobListCache because it is not in done yet message. It will attach a stacktrace. Impact: In addition to large disk consumption, this issue blocks JobHistory.getJob long time and slows job execution down significantly because getJob is called by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When multiple threads call scanIfNeeded simultaneously, one of them acquires lock and the other threads are blocked until the first thread completes long-running HistoryFileManager.addIfAbsent call. Solution: * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time. * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips scanning if another thread is already scanning. This changes semantics of some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) because scanIfNeeded keep outdated state. * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are not blocked by a loop at scale of tens of thousands. This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue
[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630890#comment-14630890 ] Hadoop QA commented on MAPREDUCE-6436: -- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 18s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 4s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 12s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 30s | The applied patch generated 1 new checkstyle issues (total was 16, now 17). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 25s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 53s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 5m 53s | Tests passed in hadoop-mapreduce-client-hs. | | | | 44m 15s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745774/MAPREDUCE-6436.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5894/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-hs.txt | | whitespace | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5894/artifact/patchprocess/whitespace.txt | | hadoop-mapreduce-client-hs test log | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5894/artifact/patchprocess/testrun_hadoop-mapreduce-client-hs.txt | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5894/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5894/console | This message was automatically generated. JobHistory cache issue -- Key: MAPREDUCE-6436 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Attachments: MAPREDUCE-6436.1.patch, stacktrace1.txt, stacktrace2.txt, stacktrace3.txt Problem: HistoryFileManager.addIfAbsent produces large amount of logs if number of cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes larger than mapreduce.jobhistory.joblist.cache.size by far. Example: For example, if the cache contains 5 entries in total and 10,000 entries newer than mapreduce.jobhistory.max-age-ms where mapreduce.jobhistory.joblist.cache.size is 2, HistoryFileManager.addIfAbsent method produces 5 - 2 = 3 lines of Waiting to remove key from JobListCache because it is not in done yet message. It will attach a stacktrace. Impact: In addition to large disk consumption, this issue blocks JobHistory.getJob long time and slows job execution down significantly because getJob is called by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When multiple threads call scanIfNeeded simultaneously, one of them acquires lock and the other threads are blocked until the first thread completes long-running HistoryFileManager.addIfAbsent call. Solution: * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time. * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips scanning if another thread is already scanning. This changes semantics of some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) because scanIfNeeded keep outdated state. * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are not blocked by a loop at scale of tens of thousands. This patch implemented