[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618938#comment-14618938 ] stack commented on HBASE-14028: --- Latest: I see a swath of edits that were posted below not showing up on the far end though they were apparently successfully digested on the remote end (found by aligning count of edits and some extra logging of sequenceids added in testbed): 2015-07-07 07:16:35,728 DEBUG [RS_LOG_REPLAY_OPS-c2024:16020-0-Writer-2] wal.WALEditsReplaySink: Replayed 231 edits in 2458ms into region=IntegrationTestBigLinkedList,\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xF8,1436277280607.bb166b99140bcd32df68676b4e1b60b2., hostname=c2025.halxg.cloudera.com,16020,1436278565173, seqNum=320072187, lastSequenceId=280072763 At the time, the recovering region is flushing -- a few logs are being replayed into this recovering region concurrently -- which is what is unusual around this event. I don't really seen filtering going on sink-side (except if not the primary replica). Adding more logging and retrying. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619615#comment-14619615 ] stack commented on HBASE-14028: --- I added logging and reran. Found another failure type beyond the above described replay over a coincident flush. Highlevel, region opens, we start to replay edits but well before the replay can finish, the server hosting the newly opened region crashes. Edits in the WAL we were replaying get skipped on second attempt. Here is open before crash: 2015-07-08 12:45:38,317 DEBUG [RS_OPEN_REGION-c2023:16020-0] wal.WALSplitter: Wrote region seqId=hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/467eaf13c7ce1f2e1afb1c567322c9e7/recovered.edits/760185051.seqid to file, newSeqId=760185051, maxSeqId=720162792 Here is open after crash: 2015-07-08 12:45:49,920 DEBUG [RS_OPEN_REGION-c2025:16020-1] wal.WALSplitter: Wrote region seqId=hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/467eaf13c7ce1f2e1afb1c567322c9e7/recovered.edits/800185051.seqid to file, newSeqId=800185051, maxSeqId=760185051 See how newSeqId the first time around becomes the maxSeqId the second time we open. This is broke (this is the well-padded sequence id set well in advance of any edits that could come in during replay). See how on subsequent replay we end up skipping most of the edits: 2015-07-08 12:46:25,103 INFO [RS_LOG_REPLAY_OPS-c2025:16020-1] wal.WALSplitter: Processed 80 edits across 0 regions; edits skipped=1583; log file=hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1436383987497-splitting/c2021.halxg.cloudera.com%2C16020%2C1436383987497.default.1436384632799, length=72993715, corrupted=false, progress failed=false (Says 80 edits for ZERO regions... ) The maximum sequence id in the WAL to replay is 720185601 even though we did not replay all edits. So, at least two issues. Let me put this aside since it looks like it won't make hbase-1.2.0 at this late stage. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617997#comment-14617997 ] stack commented on HBASE-14028: --- I have been playing more with this. Losing data is pretty easy to do. Trying to find why the end of a WAL goes missing during replay; there is not enough info to debug and it is a little tough to trace where we're at at any one time. Trying to back fill. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615868#comment-14615868 ] Vladimir Rodionov commented on HBASE-14028: --- This -recovery-from-failure-during-recovery-from-failure thing looks quite complicated to me. I am working on HBASE-7912 and one of the improvements which is on the list is WALPlayer into HFiles followed by a bulk load. Pounding HBase with millions of puts is not the right approach. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615977#comment-14615977 ] stack commented on HBASE-14028: --- bq. This -recovery-from-failure-during-recovery-from-failure thing looks quite complicated to me. Yes. It should work. All the pieces are there. Smile. I've done a few more runs and it passes sometimes. Let me try and figure the hole. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)