[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102353#comment-17102353 ] László Bodor commented on TEZ-3967: --- [~jeagles], [~gopalv], [~jeagles]: could you please look back to this patch once TEZ-4171 is resolved? I can see the [^TEZ-3967.03.patch] basically solves a lock contention problem by generating dag status periodically and returning a cached value, however, regarding TEZ-4171 I realized again that the dag state enum is a "privileged" property, which should contain the most up-to-date value which is possible... currently, in [^TEZ-3967.03.patch], i simply return the cached value like: {code} public DAGStatusBuilder getStatus() { countGetDagStatus += 1; return cachedDagStatusBuilder; } {code} but I would change it in a way that force the returned state to be the latest (no matter of the additional read-lock compared to the original optimization), otherwise, this patch could ruin the optimization of e.g. TEZ-4171 I think {code} public DAGStatusBuilder getStatus() { countGetDagStatus += 1; cachedDagStatusBuilder.setState(getState()); return cachedDagStatusBuilder; } {code} I don't think I can optimize this further at the moment, only if we implement an endpoint which returns only the state, I guess most of the time while optimizing this part (e.g. TEZ-4171), client is only interested in the state, not the counters (querying it in every ~100ms) > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch, TEZ-3967.03.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095928#comment-17095928 ] Jonathan Turner Eagles commented on TEZ-3967: - Going to prioritize this review for tomorrow. Thanks for your patience > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch, TEZ-3967.03.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088683#comment-17088683 ] TezQA commented on TEZ-3967: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 9m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 19s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 7s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 19s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 9s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 36s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 8s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 9s{color} | {color:orange} tez-common: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 19s{color} | {color:orange} tez-dag: The patch generated 2 new + 129 unchanged - 0 fixed = 131 total (was 129) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 42s{color} | {color:green} tez-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s{color} | {color:green} tez-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 38s{color} | {color:red} tez-dag in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 34m 19s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | tez.dag.app.TestSpeculation | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/PreCommit-TEZ-Build/362/artifact/out/Dockerfile | | JIRA Issue | TEZ-3967 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13000702/TEZ-3967.03.patch | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile | | uname | Linux a85563ff7ae8 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/tez.sh | | git revision | master / e7c24f0 | | Default Java | 1.8.0_242 | | checkstyle | https://builds.apache.org/job/PreCommit-TEZ-Build/362/artifact/out/di
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088651#comment-17088651 ] László Bodor commented on TEZ-3967: --- [~jeagles]: in [^TEZ-3967.03.patch], there is a pluggable "instrumented" read-write lock, what I've used in the unit tests, for instance: {code} mvn test -Dtest=TestDAGImpl#testGroupDAGCompletionWithCommitSuccess -pl tez-dag {code} I can see that the numbers were better in case of the new approach, let me share some numbers I believe the most important is "AVG_LOCK_BLOCKED" kind of numbers, which shows the average time spent before and after calling .lock() (I guess this is the best way to check for how much time a thread is blocked blocked because of locks) with an average 1ms improvement per lock, it can add up into a major improvement for a bigger/longer dag (considering that testGroupDAGCompletionWithCommitSuccess is a quite short-living unit test) WITHOUT THE PATCH: average blocked time for write lock: 2.8, 3.42, 2.82 {code} READ_AVG_LOCK_BLOCKED_TIMES_MS: 2.1760985915492954, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 71, READ_MAX_LOCK_BLOCKED_TIME_MS: 15.031, READ_SUM_LOCK_BLOCKED_TIMES_MS: 154.503, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 2.809090909090909, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 11, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 16.09, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 30.9] READ_AVG_LOCK_BLOCKED_TIMES_MS: 2.330492957746479, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 71, READ_MAX_LOCK_BLOCKED_TIME_MS: 18.983, READ_SUM_LOCK_BLOCKED_TIMES_MS: 165.465, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 3.4274545454545455, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 11, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 17.092, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 37.702] READ_AVG_LOCK_BLOCKED_TIMES_MS: 2.427774647887324, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 71, READ_MAX_LOCK_BLOCKED_TIME_MS: 15.377, READ_SUM_LOCK_BLOCKED_TIMES_MS: 172.372, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 2.8256363636363635, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 11, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 16.377, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 31.082] {code} WITH THE PATCH: average blocked time for write lock: 2.01, 1.51, 1.65 {code} READ_AVG_LOCK_BLOCKED_TIMES_MS: 1.8155, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 60, READ_MAX_LOCK_BLOCKED_TIME_MS: 6.974, READ_SUM_LOCK_BLOCKED_TIMES_MS: 108.93, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 2.0169, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 10, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 4.047, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 20.169] READ_AVG_LOCK_BLOCKED_TIMES_MS: 1.46175, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 60, READ_MAX_LOCK_BLOCKED_TIME_MS: 5.533, READ_SUM_LOCK_BLOCKED_TIMES_MS: 87.705, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 1.5105, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 10, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 3.642, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 15.105] READ_AVG_LOCK_BLOCKED_TIMES_MS: 1.50095, READ_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 60, READ_MAX_LOCK_BLOCKED_TIME_MS: 4.009, READ_SUM_LOCK_BLOCKED_TIMES_MS: 90.057, WRITE_AVG_LOCK_BLOCKED_TIMES_MS: 1.6573, WRITE_COUNT_LOCK_ACQUIRED_BY_LOCK_METHOD: 10, WRITE_MAX_LOCK_BLOCKED_TIME_MS: 5.042, WRITE_SUM_LOCK_BLOCKED_TIME_MS: 16.573] {code} > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch, TEZ-3967.03.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087862#comment-17087862 ] László Bodor commented on TEZ-3967: --- [~jeagles]: to be honest, I haven't tested this on cluster, but if patch looks good, I'll check it, let me find a way to measure lock contention somehow > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087806#comment-17087806 ] Jonathan Turner Eagles commented on TEZ-3967: - [~abstractdog], what did you find as far as lock contention goes with the patch vs without. I seems like this kind of approach will be better as reads should not need to build the cache with the readlock anymore, but numbers will prove this better. > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087761#comment-17087761 ] László Bodor commented on TEZ-3967: --- [~gopalv], [~jeagles]: could you please take a look? > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085027#comment-17085027 ] TezQA commented on TEZ-3967: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 54s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 8s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 41s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 9s{color} | {color:orange} tez-common: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 20s{color} | {color:orange} tez-dag: The patch generated 4 new + 129 unchanged - 0 fixed = 133 total (was 129) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 2s{color} | {color:red} tez-dag generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s{color} | {color:green} tez-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 27s{color} | {color:green} tez-dag in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 39s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:tez-dag | | | Unread field:field be static? At DAGImpl.java:[line 2477] | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/PreCommit-TEZ-Build/355/artifact/out/Dockerfile | | JIRA Issue | TEZ-3967 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13000167/TEZ-3967.02.patch | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile | | uname | Linux 209ac2d76222 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/tez.sh | | git revision | master / 214c544 | | Default Java | 1.8.0_242 | | checkstyle | https://builds.apache.org/job/PreCommit-TEZ-Build/355/artifact/out/diff-checkstyle-tez-common.txt | | checkstyle |
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085020#comment-17085020 ] TezQA commented on TEZ-3967: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 44s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 3s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 48s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 40s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 8s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 9s{color} | {color:orange} tez-common: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 19s{color} | {color:orange} tez-dag: The patch generated 4 new + 129 unchanged - 0 fixed = 133 total (was 129) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 4s{color} | {color:red} tez-dag generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 31s{color} | {color:green} tez-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 19s{color} | {color:green} tez-dag in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 37m 40s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:tez-dag | | | Unread field:field be static? At DAGImpl.java:[line 2477] | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/PreCommit-TEZ-Build/354/artifact/out/Dockerfile | | JIRA Issue | TEZ-3967 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13000167/TEZ-3967.02.patch | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile | | uname | Linux 0fb113e146dd 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/tez.sh | | git revision | master / 214c544 | | Default Java | 1.8.0_242 | | checkstyle | https://builds.apache.org/job/PreCommit-TEZ-Build/354/artifact/out/diff-checkstyle-tez-common.txt | | checkstyle |
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084977#comment-17084977 ] László Bodor commented on TEZ-3967: --- [~jeagles]: attached [^TEZ-3967.02.patch] , which contains the following basically: 1. DagStatusHandler: which encapsulates the dag status refresh logic 2. I noticed that getCachedCounters were invalid, [it didn't refresh cachedCounters at all|https://github.com/apache/tez/blob/214c544472f80997755b62c382f016fbe8373177/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]...but with this patch I started to use cached counters the basic idea is to refresh different fields with a different frequency (please let me know which should be configurable in your opinion) 2a) dag status refresh: 500ms - DAG_STATUS_BUILD_INTERVAL_MS 2b) counters refresh: >1000ms - COUNTER_CACHE_INTERVAL_THRESHOLD_FAST_MS 3. DrainDispatcher minor fix (I've found sometimes that it was hanging in the while loop, even when queue was empty), this class is used only in tests 4. TestDAGImpl: added some dag.entityUpdateTracker.stop(), becuase we manually set entityUpdateTracker into DAGImpl, but there is another one, which already started its thread (I haven't indentified error because of this, so just for clarity's sake) > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076646#comment-17076646 ] László Bodor commented on TEZ-3967: --- thanks for the quick response [~jeagles]! "Alternatively, a new approach could be made where the calculations are done on a periodic basis and then the readcalls would return only cached values." I was thinking about the same, I think it's not much more effort than hunting for a better locking mechanism, I'll try that > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076543#comment-17076543 ] Jonathan Turner Eagles commented on TEZ-3967: - Perhaps, this fails to address the issue at hand. For the getAllCounters case, by moving the getInternalState outside of the readlock, the cost of the call is actually more. getInternalState has to lock and unlock the readlock. While splitting the lock is more costly, there is opportunity for another call to acquire the read lock, perhaps reducing the average latency time of acquiring the lock for other calls. On the other hand, you all the possibility of getting inconsistent results as a write could have taken place in between the two readlocks. Similar for the getDAGStatus calls that were changed as well. The difference there is that some of the changes inside of the readlock could actually be moved outside of the readlock without risk, though not sure how much benefit it will be. {code:title=Safe to access outside of the readlock} ProgressBuilder dagProgress = new ProgressBuilder(); dagProgress.setTotalTaskCount(totalTaskCount); dagProgress.setSucceededTaskCount(totalSucceededTaskCount); dagProgress.setRunningTaskCount(totalRunningTaskCount); dagProgress.setFailedTaskCount(totalFailedTaskCount); dagProgress.setKilledTaskCount(totalKilledTaskCount); dagProgress.setFailedTaskAttemptCount(totalFailedTaskAttemptCount); dagProgress.setKilledTaskAttemptCount(totalKilledTaskAttemptCount); dagProgress.setRejectedTaskAttemptCount(totalRejectedTaskAttemptCount); {code} Alternatively, a new approach could be made where the calculations are done on a periodic basis and then the readcalls would return only cached values. > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076468#comment-17076468 ] László Bodor commented on TEZ-3967: --- [~gopalv], [~jeagles]: could you please take a look at [^TEZ-3967.01.patch] it's basically about narrowing the scope of the read locking 1. removed read lock from getDAGStatus, as it doesn't contain sensitive operation I think (except counters, but it's handled in called getAllCounters method) -> first I thought that iterating vertexMap's entryset could be dangerous without lock, but I don't expect ConcurrentModificationException as vertexMap is initialized initializeDAG and then the number of entries are not supposed to change 2. getAllCounters removed readlock from the beginning of the method, as mayBeConstructFinalFullCounters is protected with a fullCountersLock itself + getInternalState method also has its own locking do you think this patch can help getDAGStatus lock footprint? if so, let me know if you want me to do any performance/sanity check > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal Vijayaraghavan >Assignee: László Bodor >Priority: Major > Attachments: TEZ-3967.01.patch > > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075164#comment-17075164 ] TezQA commented on TEZ-3967: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 9s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 57s{color} | {color:red} tez-dag in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 9s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 25m 34s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | tez.dag.app.TestSpeculation | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/PreCommit-TEZ-Build/335/artifact/out/Dockerfile | | JIRA Issue | TEZ-3967 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12998833/TEZ-3967.01.patch | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile | | uname | Linux 0f7807f2bd32 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/tez.sh | | git revision | master / b5b432b | | Default Java | 1.8.0_242 | | unit | https://builds.apache.org/job/PreCommit-TEZ-Build/335/artifact/out/patch-unit-tez-dag.txt | | Test Results | https://builds.apache.org/job/PreCommit-TEZ-Build/335/testReport/ | | Max. process+thread count | 228 (vs. ulimit of 5500) | | modules | C: tez-dag U: tez-dag | | Console output | https://builds.apache.org/job/PreCommit-TEZ-Build/335/console | | versions | git=2.7.4 maven=3.3.9 findbugs=3.0.1 | | Powered by | Apache Yetus 0.11.1 https://yetus.apache.org | This message was automatically generated. > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542205#comment-16542205 ] Gopal V commented on TEZ-3967: -- {code} TEZ_DAG_STATUS_CHECK_INTERVAL("hive.tez.dag.status.check.interval", "500ms", new TimeValidator(TimeUnit.MILLISECONDS), "Interval between subsequent DAG status invocation."), {code} bq. I can see unfair locking being better for through put for the exact exact same reason you are suggesting The trouble is that overlapping read-locks are starving out the write-lock (because the read-lock can lock over the other read-lock) right now. Yes, fairness is not free - if you have a different write-preferred locking for this, I think that's what we are really looking for. > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Priority: Major > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542031#comment-16542031 ] Jonathan Eagles commented on TEZ-3967: -- [~gopalv], can you indicate what the dag status poll interval is and also the number of vertices and task counts. Depending on the configuration and scale, I can see unfair locking being better for through put for the exact exact same reason you are suggesting and current locking method can be better for getting "real work" done. > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Priority: Major > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers
[ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541957#comment-16541957 ] Jaume M commented on TEZ-3967: -- [~gopalv] the description of the problem makes sense to me but I'm having problems experimentally proving that it actually leads to better results. * I'm starting a tez job with many vertices and failures (I believe the vertices ending and failing will require access to the writelock) * Also at the same time I'm polling with one thread the status of the DAG. I'd like to do it with more but the {{DAGClientImpl}} is not thread safe. * I'm measuring the average time waiting for the writelock when the polling thread waits {0, 10, 100, 1000} milliseconds after each poll but I'm not seeing any significant results Is there a scenario in which many threads could poll the state? > DAGImpl: dag lock is unfair and can starve the writers > -- > > Key: TEZ-3967 > URL: https://issues.apache.org/jira/browse/TEZ-3967 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Priority: Major > > Found when debugging HIVE-20103, that a reader arriving when another reader > is active can postpone a writer from obtaining a write-lock. > This is fundamentally bad for the DAGImpl as useful progress can only happen > when the writeLock is held. > {code} > public void handle(DAGEvent event) { > ... > try { > writeLock.lock(); > {code} > {code} >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7efb02246f40> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) > at java.lang.Thread.run(Thread.java:745) > {code} > while read-lock is passed around between > {code} >at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940) > at > org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73) > {code} > calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)