[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119446#comment-14119446 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #465 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/465/]) Revert HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev ee32706c5d93fb3de6f4aba09174d34ca3879f6d) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119502#comment-14119502 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98 #493 (See [https://builds.apache.org/job/HBase-0.98/493/]) HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev 39771b8f73a6e6eae12e8b3bdb7dd1fe13edc83c) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119664#comment-14119664 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #466 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/466/]) HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev 39771b8f73a6e6eae12e8b3bdb7dd1fe13edc83c) * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119075#comment-14119075 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #463 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/463/]) HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev fd10bde5af20d6db96207cc2e29b779e117acf19) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119133#comment-14119133 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98 #489 (See [https://builds.apache.org/job/HBase-0.98/489/]) HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev fd10bde5af20d6db96207cc2e29b779e117acf19) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119209#comment-14119209 ] Liu Shaohui commented on HBASE-11868: - [~apurtell] Let me fix the failed tests. Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.7 Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119223#comment-14119223 ] Andrew Purtell commented on HBASE-11868: Hi [~lshmouse], if it's possible to do that in the next few hours this can make .6. Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.7 Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119295#comment-14119295 ] Liu Shaohui commented on HBASE-11868: - [~apurtell] The reason is that the initialized failedTxid is 0. If there is no update in the test, the sync operation with txid = 0 with in test will failed for unflushedEntries is 0, which equals to failedTxid. Change the initialized failedTxid to -1 will fix the failed tests. Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.7 Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119410#comment-14119410 ] Hudson commented on HBASE-11868: FAILURE: Integrated in HBase-0.98 #492 (See [https://builds.apache.org/job/HBase-0.98/492/]) Revert HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev ee32706c5d93fb3de6f4aba09174d34ca3879f6d) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Fix For: 0.98.6 Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117202#comment-14117202 ] Honghua Feng commented on HBASE-11868: -- +1 nice finding, thanks [~lshmouse] for the patch Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid is less than syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore if flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116990#comment-14116990 ] Liu Shaohui commented on HBASE-11868: - [~apurtell] I think it is a critical bug in hlog. Would you want to fix this in 0.98.6? Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker When using the new thread model in hbase, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid is less than syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore if flushed, the data will be lost. {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11868) Data loss in hlog when the hdfs is unavailable
[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116995#comment-14116995 ] Hadoop QA commented on HBASE-11868: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12665695/HBASE-11868-0.98-v1.diff against trunk revision . ATTACHMENT ID: 12665695 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10662//console This message is automatically generated. Data loss in hlog when the hdfs is unavailable -- Key: HBASE-11868 URL: https://issues.apache.org/jira/browse/HBASE-11868 Project: HBase Issue Type: Bug Affects Versions: 0.98.5 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Blocker Attachments: HBASE-11868-0.98-v1.diff When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid is less than syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore if flushed, the data will be lost. See: FSHLog.java #1348 {code} // sync all transactions upto the specified txid private void syncer(long txid) throws IOException { synchronized (this.syncedTillHere) { while (this.syncedTillHere.get() txid) { try { this.syncedTillHere.wait(); if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } } catch (InterruptedException e) { LOG.debug(interrupted while waiting for notification from AsyncNotifier); } } } } {code} We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)