[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590273#comment-16590273 ] stack commented on HBASE-18152: --- Thank you [~allan163] It might be a different corruption type. Hopefully there'll be a clue when we dig in on the WALs. I've not seen it since the fix for the race condition. Cross-fingers. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589954#comment-16589954 ] Allan Yang commented on HBASE-18152: {quote} Do you have the WAL files still? {quote} Sorry, I didn't keep them. I will keep them next time and have a look. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588978#comment-16588978 ] stack commented on HBASE-18152: --- (Arr) [~allan163] Do you have the WAL files still? If so, attach and I can take a look. Or if you want to look, there is a pretty-printer for the master wal procedure files. What I've seen in the above is that the order in which we write the procedures into the WAL gets flipped... usually an update on a particular procedure has stored step N+1 before step N. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588571#comment-16588571 ] Allan Yang commented on HBASE-18152: My test env reproduced this issue again with HBASE-20939. Maybe this issue is not totally fixed by it. {code} 2018-08-22 15:57:35,010 ERROR [Thread-14] procedure2.ProcedureExecutor(388): Corrupt pid=16633, ppid=16510, state=WAITING_TIMEOUT:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=IntegrationTestBigLinkedList, region=3407f9af348ff198a3bdd3a9ae7db02c {code} > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559196#comment-16559196 ] stack commented on HBASE-18152: --- I think HBASE-20939 uncovers the root of corruption I've been trying to figure in here. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555146#comment-16555146 ] stack commented on HBASE-18152: --- A few notes: * I spent a good part of today trying to reproduce but was unable (see [^0001-TestWALProcedureExecutore-order-checking-test-that-d.patch] ). Even with high-concurrency and artificial friction, 160 concurrent worker threads always wrote in-order against default 16 slots. * Interesting is that a Worker thread will usually try and run a Procedure through to the end UNLESS it suspends or returns true from its isYieldAfterExecutionStep implementation. In this case, the AssignProcedure on the *Staring* step suspends itself as part of normal operation; a thread in the AssignmentManager is in charge of wakening the assign. This is how we make sure meta goes out first and how we batch up assigning. This technique makes the AssignProcedure in particular susceptible to a thread-switch when we move to *Dispatch*. * Let me see what would take to get rid of this WALProcedureStore. It is doing what we do elsewhere writing WAL but using different primitives (Conditions) with recovery and WAL rolling. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554562#comment-16554562 ] Josh Elser commented on HBASE-18152: {quote}Looking at this was procedure store I'm currently trying to figure why we have a third way of writing wals and was thinking of putting in place fshlog or asyncfs. {quote} Strong +1 on that. I know Enis was frustrated about this way back when. {quote}I appreciate the sort suggestion. I am not sure the edits we're part of the same batch so am unsure it would help. Might be worth trying though? {quote} Yeah, could be hair-brained. Not sure yet. I'll give it another look and see what I can come up with. I was trying to come up with something that doesn't just funnel our proc executors into a single synchronized block. But... maybe I don't need to be worried about that? I don't have a good feeling for how critical this part is – simpler fix might be to just hoist up the synchronization a level higher (e.g. out of pushData and into insert, update, delete) > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554540#comment-16554540 ] stack commented on HBASE-18152: --- [~elserj] thank you for jumping in. I appreciate being able to chat on this. I'm trying to make a test but it's a bit hard manufacturing. Looking at this was procedure store I'm currently trying to figure why we have a third way of writing wals and was thinking of putting in place fshlog or asyncfs. I appreciate the sort suggestion. I am not sure the edits we're part of the same batch so am unsure it would help. Might be worth trying though? Any other ideas welcome. Thanks. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554529#comment-16554529 ] Josh Elser commented on HBASE-18152: {quote}Notice how they are on different worker threads. The procedures "execute" in the correct order but the edits land in the store in another order, something quiet possible given the distance between execute, post-execution cleanup, and store. {quote} Good find, sir! {quote}Looking at code, when we write we first fill one of a fixed number of "slots" with the serialized update of Procedure STEP (See WALProcedureStore#insert/#update). We then try to 'push' the data to the WAL (See WPS#pushData). Inside in pushData is a reentrant lock to ensure single-threaded-access updating WAL. During a startup, we can burst to have 10x the Workers we usually have. The "slot" number remains constant. So contention for slots and then to obtain the reentrant lock. Reentrant locks make no guarantees around the order in which threads get scheduled (This is not a 'fair' reentrant lock and even then, still no guarantees) so the code here is suspect. {quote} Totally makes sense. But I'm not sure what a "fix" would look like (not sure if you've started thinking about that yet, sorry to derail if you haven't) Right now, each slot holds the serialized Procedure and then the syncLoop will take all slots and write them to the file. What if we make the {{ByteSlot[]}} instead a {{TreeMap}}, locally sorting each batch of slots (really, Procedures), that we're going to group commit to the WAL? I think that would work since we our Slot never holds more than one Procedure at a time (multiple subProcs, but I'm not sure if that matters?). Tell me to pump the brakes if I'm not being helpful :) > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553919#comment-16553919 ] stack commented on HBASE-18152: --- ProcedureStore depends on updates being ordered. For the above case, the two steps -- "Starting" and "Dispatch" -- were run by different Worker threads (PEWorker-8 and then PEWorker-10). The two steps are separated in time by ~70ms and to run, each needs to take the lock on the region. But the updates went into the WAL flipped from how they ran chronologically and how they appeared in the log above. Looking at code, when we write we first fill one of a fixed number of "slots" with the serialized update of Procedure STEP (See WALProcedureStore#insert/#update). We then try to 'push' the data to the WAL (See WPS#pushData). Inside in pushData is a reentrant lock to ensure single-threaded-access updating WAL. During a startup, we can burst to have 10x the Workers we usually have. The "slot" number remains constant. So contention for slots and then to obtain the reentrant lock. Reentrant locks make no guarantees around the order in which threads get scheduled (This is not a 'fair' reentrant lock and even then, still no guarantees) so the code here is suspect. Trying to write a test to prove flawed ordering/corruption. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551344#comment-16551344 ] stack commented on HBASE-18152: --- Saw this again. Here is current example: {code} 2018-07-17 22:33:03,678 WARN [Thread-18] wal.ProcedureWALFormatReader: NOT INCREASING! current=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 3902 proc_id: 3906 submitted_time: 1531890545613 owner: "stack" state: RUNNABLE stack_id: 6 stack_id: 16 last_update: 1531890545980 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\\0008\000" } , candidate=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 3902 proc_id: 3906 submitted_time: 1531890545613 owner: "stack" state: RUNNABLE stack_id: 6 last_update: 1531890545812 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\\0008\000" } ... {code} We have written the entries backwards indeed. Looking at the logging for this particular procedure, I see the below entries which correspond to the last_update times: {code} 2018-07-17 22:09:05,812 INFO [PEWorker-8] assignment.AssignProcedure: Starting pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; rit=OFFLINE, location=ve0538.halxg.cloudera.com,16020,1531890104902; forceNewPlan=false, retain=true 2018-07-17 22:09:05,980 INFO [PEWorker-10] assignment.RegionTransitionProcedure: Dispatch pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; rit=OPENING, location=ve0536.halxg.cloudera.com,16020,1531889066114 {code} Notice how they are on different worker threads. The procedures "execute" in the correct order but the edits land in the store in another order, something quiet possible given the distance between execute, post-execution cleanup, and store. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528042#comment-16528042 ] stack commented on HBASE-18152: --- I just manufactured a corrupt MasterWALProc by kill -9'ing Master (odd because this was a cluster that had just been doing hours of ITBLL with server killings). Recovery of the last WAL seems pretty crappy too. We seem to be missing 100 pids off the end. Is it the case that any WAL cut off on its tail generates "NOT INCREASING" messages? Let me check... > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518554#comment-16518554 ] stack commented on HBASE-18152: --- Had another incidence here on internal test rig. Procedure updates seem to have gone in in reversed order The 'last update' time and the list of stack ids -- the sub-procedures the root needs to roll back are out of order with the first update having two stack ids where the one that comes after has one (see example below) A bunch of sub-procedure Assigns of a SCP were written this way. I'm having difficulty because the update to the Store is under region lock so the updates should be in order but timeout doesn't have lock. The incident that brought this on does not show in the logs because a STUCK spew rolled away the logs. In another issue I tune down the STUCK logging. Hopefully it helps. Would like to figure what the 'event' was. Would help narrow which path brought this about (also my master WAL doesn't show the events that were a problem... They seem to have been cut off... the file was not closed nicely). {code} 2018-06-14 12:57:16,956 WARN org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader: NOT INCREASING! current=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 864403 proc_id: 916353 submitted_time: 1528730391463 owner: "hbase" state: RUNNABLE stack_id: 251 stack_id: 332 last_update: 1528730393598 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\\0008\000" } , candidate=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 864403 proc_id: 916353 submitted_time: 1528730391463 owner: "hbase" state: RUNNABLE stack_id: 251 last_update: 1528730393404 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\\0008\000" } {code} > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437836#comment-16437836 ] Josh Elser commented on HBASE-18152: Oh, exciting. I have the pv2 wal still in oldWALs too. Let me look at that. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437835#comment-16437835 ] Josh Elser commented on HBASE-18152: I think I saw something similar to this, but it may be less of the bug Stack outlines and more self-inflicted. I was trying out 2.0.0-rc0 on Hadoop 3. The version of Hadoop3 I had installed doesn't work with the asyncwal, which I noticed after tryign to start up the first time. Then, I realized I needed to set the meta-wal provider configuration option in addition to the normal wal provider option. IIRC, {{stop-hbase.sh}} hung so I ended up {{kill}}'ing the process. This caused a bunch of WAL split failures in the Master to get logged around {{2018-04-12 21:43:01}}. Interestingly, the Master eventually started complaining around the region transition that the connection was null (because the server was stopping, I assume) {code:java} 2018-04-12 21:43:05,781 WARN [PEWorker-8] assignment.RegionTransitionProcedure: Retryable error trying to transition: pid=3586, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, region=0e5c6da974e1d3753ba0cb6a8f0db5b1; rit=OPENING, location=ctr-e138-1518143905142-221855-01-09.hwx.site,16020,1523564755147 java.io.InterruptedIOException: Interrupted after 7 tries while maxAttempts=31 at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:173) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360) at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1076) at org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:213) at org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:191) at org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:163) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:295) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1453) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741) {code} After lots of these, I think I eventually kill'ed the Master and restarted it. When it came back, I got all of the corrupt pv2 wal errors: {noformat} 2018-04-12 22:05:24,050 INFO [master/ctr-e138-1518143905142-221855-01-02:16000] util.FSHDFSUtils: Recover lease on dfs file hdfs://ctr-e138-1518143905142-221855-01-02.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-0008.log 2018-04-12 22:05:24,054 INFO [master/ctr-e138-1518143905142-221855-01-02:16000] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://ctr-e138-1518143905142-221855-01-02.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-0008.log after 3ms 2018-04-12 22:05:24,103 INFO [master/ctr-e138-1518143905142-221855-01-02:16000] procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 55msec 2018-04-12 22:05:24,431 ERROR [master/ctr-e138-1518143905142-221855-01-02:16000] procedure2.ProcedureExecutor: Corrupt pid=3580, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, region=71fefffe6b5b3cf1cb6d3328a5a58690 2018-04-12 22:05:24,431 ERROR [master/ctr-e138-1518143905142-221855-01-02:16000] procedure2.ProcedureExecutor: Corrupt pid=3586, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, region=0e5c6da974e1d3753ba0cb6a8f0db5b1 2018-04-12 22:05:24,431 ERROR [master/ctr-e138-1518143905142-221855-01-02:16000] procedure2.ProcedureExecutor: Corrupt pid=3593, ppid=3534, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, region=f19fbb2563464bcafd528eeba863c10f 2018-04-12 22:05:24,431 ERROR [master/ctr-e138-1518143905142-221855-01-02:16000] procedure2.ProcedureExecutor: Corrupt pid=3592, ppid=3534, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, region=7df42ee12cc6ff354144e7915e6f0c5a {noformat} Will attach the relevant master log. I have ITBLL running now, but will try to reproduce these steps and see if I can get something reliably happening. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > -
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129073#comment-16129073 ] stack commented on HBASE-18152: --- Made this a standalone issue. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081631#comment-16081631 ] Hudson commented on HBASE-18152: FAILURE: Integrated in Jenkins build HBase-1.4 #803 (See [https://builds.apache.org/job/HBase-1.4/803/]) Revert "HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure (apurtell: rev 844596e09e7961f506ab2e8bb4aee5557e1b4a44) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071058#comment-16071058 ] Hudson commented on HBASE-18152: FAILURE: Integrated in Jenkins build HBase-1.4 #796 (See [https://builds.apache.org/job/HBase-1.4/796/]) HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: rev 8ce3f49f8034417734ef1a3e6c624a601a22ace9) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052644#comment-16052644 ] stack commented on HBASE-18152: --- Looks like we have a version of this problem in branch-1 too. This is from a [~tsuna] 1.3.1 log: {code} 2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] procedure2.ProcedureExecutor: corrupted procedure: Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=96, owner=, state=RUNNABLE, startTime=6480hrs, 32mins, 51sec ago, lastUpdate=6480hrs, 32mins, 51sec ago) 2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] procedure2.ProcedureExecutor: corrupted procedure: Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=15, owner=, state=RUNNABLE, startTime=7032hrs, 28mins, 23sec ago, lastUpdate=7032hrs, 28mins, 23sec ago) 2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] procedure2.ProcedureExecutor: corrupted procedure: Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=77, owner=, state=RUNNABLE, startTime=7032hrs, 21mins, 11sec ago, lastUpdate=7032hrs, 21mins, 11sec ago) {code} > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048833#comment-16048833 ] Hudson commented on HBASE-18152: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3192 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3192/]) HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: rev 0b43353bf76f19e020e2831691a832722b590915) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL; (stack: rev 550b6c585e0390bc80516e64df8bd1a3a6e10e23) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048805#comment-16048805 ] Hudson commented on HBASE-18152: FAILURE: Integrated in Jenkins build HBase-2.0 #40 (See [https://builds.apache.org/job/HBase-2.0/40/]) HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: rev 1eb2f2593d00b6e92f31fe9811fdd656fcf2847b) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL; (stack: rev 16a5a9db3edf3594b492f548b78bf8342884373f) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048704#comment-16048704 ] stack commented on HBASE-18152: --- Pushed workaround over in HBASE-18216. Need to figure root cause still. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0036.log, pv2-0047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045790#comment-16045790 ] stack commented on HBASE-18152: --- Looking at the corruption in 36.log, we are indeed missing stuff off the end of the WAL. The missing entries look like they would have been on the end of the WAL. There is a second at least between their event and the master crash. I presume in this second its thread messing around trying to persist to the WAL store. We are writing edits out-of-order in some circumstance (see experience w/ previous log and the attached workaround patch which helps...). This makes for possibility of their being holes if we expect events in-order as the smart verification check does. Need to dig in on the way we log. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-17537.master.002.patch, > HBASE-18152.master.001.patch, pv2-0036.log, > pv2-0047.log, reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1603#comment-1603 ] stack commented on HBASE-18152: --- This is what corrupt procedures show up as: {code} 2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager] procedure2.ProcedureExecutor: Corrupt Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=16, parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, lastUpdate=14sec ago) 2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager] procedure2.ProcedureExecutor: Corrupt Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=13, parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, lastUpdate=14sec ago) 2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager] procedure2.ProcedureExecutor: Corrupt Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=11, parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, lastUpdate=14sec ago) {code} Attached patch is a workaround that at read time checks that the new-found entry is for sure 'increasing' when compared to current entry we have for a Procedure. If not, we WARN and drop it. This workaround is good for the corruption shown here. Will run with it to see if I can find other corruption types and to see if I can figure how we are writing out of order. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > - > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18152.master.001.patch, > pv2-0047.log, reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v6.3.15#6346)