[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366407#comment-16366407 ] Hudson commented on HDFS-13112: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13666 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13666/]) HDFS-13112. Token expiration edits may cause log corruption or deadlock. (kihwal: rev 47473952e56b0380147d42f4110ad03c2276c961) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/RwLock.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/security/token/delegation/DelegationTokenSecretManager.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSecurityTokenEditLog.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystemLock.java > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6, 3.2.0 > > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366307#comment-16366307 ] Kihwal Lee commented on HDFS-13112: --- Committed this to trunk, branch-3.1, branch-3.0, branch-3.0.1, branch-2, branch-2.9, branch-2.8 and branch-2.7. Thanks for fixing this, Daryn and also I appreciate the review from Xiao. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6, 3.2.0 > > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366274#comment-16366274 ] Kihwal Lee commented on HDFS-13112: --- Will commit it shortly. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366261#comment-16366261 ] Xiao Chen commented on HDFS-13112: -- Thanks for the comments Daryn. Given this really is just update master key which is infrequent, I'm +1 on the patch as-is too. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366049#comment-16366049 ] Daryn Sharp commented on HDFS-13112: Xiao, good questions. Yes, typically the edit log should not be synced while holding the lock, esp. the write lock because it stalls everything. Simple answer, it was already there. Infrequent syncs with the read lock are probably "ok". I could just remove the sync. The only "risk" is issuing tokens with a lost key, which isn't an issue because if the token is synced, its secret was implicitly synced. Expiry edits don't need a sync for the reason you state. Failover will expire them. Unlike an explicit cancel, an expiry isn't essential for consistency. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359271#comment-16359271 ] Xiao Chen commented on HDFS-13112: -- Thanks for the fix Daryn and Kihwal. I have not reviewed as careful as Kihwal did, but from what I see, LGTM. :) One question: {code:title=FSNamesystem.java} public void logUpdateMasterKey(DelegationKey key) { ... assert hasReadLock(); getEditLog().logUpdateMasterKey(key); getEditLog().logSync(); } {code} I think {{logSync}} is usually done outside of the FSN lock, why not do the same here? Also just to confirm my understanding: the comment in {{logExpireDelegationToken}} says that expiration edits are batched, which is reasonable. In code there is no {{logSync}} called at the end of the {{removeExpiredToken}}, but we don't necessarily have to call it because worst case is we lost it on failover, and new NN will still remove it in the next interval. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359024#comment-16359024 ] genericqa commented on HDFS-13112: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 44s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 218 unchanged - 0 fixed = 219 total (was 218) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}135m 25s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}183m 3s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.hdfs.TestErasureCodingPolicies | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 | | JIRA Issue | HDFS-13112 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12909969/HDFS-13112.1.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 4b445362a84a 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 543f3ab | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/23014/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23014/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results |
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358820#comment-16358820 ] Daryn Sharp commented on HDFS-13112: Ok, state transitions hold the write lock when stopping the secret manager # I need to acquire the lock interruptibly to avoid the deadlock. # The write lock's interruptible method was exposed but not read, so added that. # The noInterruptsLock technically isn't necessary anymore if caller stopping the secret manager has the write lock, but per comments I left it there for safety. # The methods no longer throw InterruptedIOException, but leave or set the interrupt flag if interrupted. Why? ** The abstract secret manager's master key roll currently catches ioes and plows ahead. Expects the while (!done) to exit cleanly. Survives the interrupt. But can cause expiry to crash. ** Expiring a token does not catch exceptions, so an interrupt is currently fatal. Not good. ** The run loop's sleep catches interrupted exceptions, allowing it to reach the while (!done). ** So why I do I leave the interrupt set instead of throwing? Less risky to avoid changing the abstract secret manager. Rolling a key will catch the interrupt. If it also decides to expire tokens in the same cycle, it will try to acquire the read lock again, and deadlock again. Leaving the thread interrupted prevents that and allows the run loop to hit the exit condition. ** Went ahead and individually lock per-token, in the off case there's a glut of tokens to expire and edit logging is being slow (think QJM). > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.1.patch, HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357180#comment-16357180 ] Kihwal Lee commented on HDFS-13112: --- The handler for {{enterSafeMode}} has acquired a write lock. The secret manager is stuck waiting for read lock. {noformat} "Thread[Thread-206,5,main]" #287 daemon prio=5 os_prio=0 tid=0x7f5ee8f59ec0 nid=0x6348 waiting on condition [0x7f5eb8acd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xd8b103c8> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readLock(FSNamesystemLock.java:142) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readLock(FSNamesystem.java:1580) at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey (DelegationTokenSecretManager.java:374) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey (AbstractDelegationTokenSecretManager.java:353) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey (AbstractDelegationTokenSecretManager.java:376) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run (AbstractDelegationTokenSecretManager.java:683) at java.lang.Thread.run(Thread.java:748) {noformat} > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357050#comment-16357050 ] Kihwal Lee commented on HDFS-13112: --- When I ran the failed tests, the following is reproduced. It seems related to the change. Please investigate. {noformat} [INFO] --- [INFO] T E S T S [INFO] --- [INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 111.306 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions [ERROR] testSecretManagerState(org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions) Time elapsed: 60.008 s <<< ERROR! java.lang.Exception: test timed out after 6 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.stopThreads(AbstractDelegationTokenSecretManager.java:653) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopSecretManager(FSNamesystem.java:1143) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.enterSafeMode(FSNamesystem.java:4535) at org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.enterSafeMode(NameNodeAdapter.java:100) at org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions.testSecretManagerState(TestHAStateTransitions.java:525) {noformat} This passes without the patch. {noformat} mvn test -Dtest=TestHAStateTransitions#testSecretManagerState {noformat} > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355844#comment-16355844 ] Kihwal Lee commented on HDFS-13112: --- The patch looks good. - The addition of read locks ensures these edit logging activities do not collide with edit rolling or HA transitions(In addition to the level of safety provided by {{noInterruptsLock}}). - A write lock is not required since these don't change any state other threads are accessing with a read lock. And only the secret manager is edit logging with a read lock and all others are using a write lock, there can be no concurrent edit logging and it covers the general {{FSEditLog}} thread safety issue, not only the issue between logging and rolling. Now, if we believe that it is only unsafe between edit logging and rolling (i.e. normal edit logging activities are thread safe), we could make {{getDelegationToken()}}, {{renewDelegationToken()}} and {{cancelDelegationToken()}} acquire a read lock. And perhaps lease-related calls too. Any thoughts on this? In any case, I'm +1 on the patch. If you think we can make additional locking changes, please file a follow-up jira. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354790#comment-16354790 ] genericqa commented on HDFS-13112: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 34s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 45s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 210 unchanged - 0 fixed = 212 total (was 210) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 26s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 36s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}133m 56s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestFileCreationClient | | | hadoop.hdfs.server.namenode.ha.TestHAStateTransitions | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.TestLeaseRecovery | | | hadoop.hdfs.TestDFSClientRetries | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 | | JIRA Issue | HDFS-13112 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12909478/HDFS-13112.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 561553e004c6 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 4304fcd | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/22960/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit |
[jira] [Commented] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
[ https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354240#comment-16354240 ] Daryn Sharp commented on HDFS-13112: Submit is pending HADOOP-15212. > Token expiration edits may cause log corruption or deadlock > --- > > Key: HDFS-13112 > URL: https://issues.apache.org/jira/browse/HDFS-13112 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.1.0-beta, 0.23.8 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-13112.patch > > > HDFS-4477 specifically did not acquire the fsn lock during token cancellation > based on the belief that edit logs are thread-safe. However, log rolling is > not thread-safe. Failure to externally synchronize on the fsn lock during a > roll will cause problems. > For sync edit logging, it may cause corruption by interspersing edits with > the end/start segment edits. Async edit logging may encounter a deadlock if > the log queue overflows. Luckily, losing the race is extremely rare. In ~5 > years, we've never encountered it. However, HDFS-13051 lost the race with > async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org