[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-08-23 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590273#comment-16590273
 ] 

stack commented on HBASE-18152:
---

Thank you [~allan163] It might be a different corruption type. Hopefully 
there'll be a clue when we dig in on the WALs.

I've not seen it since the fix for the race condition. Cross-fingers.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-08-23 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589954#comment-16589954
 ] 

Allan Yang commented on HBASE-18152:


{quote}
Do you have the WAL files still? 
{quote}
Sorry, I didn't keep them. I will keep them next time and have a look.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-08-22 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588978#comment-16588978
 ] 

stack commented on HBASE-18152:
---

(Arr)

[~allan163] Do you have the WAL files still? If so, attach and I can take a 
look. Or if you want to look, there is a pretty-printer for the master wal 
procedure files. What I've seen in the above is that the order in which we 
write the procedures into the WAL gets flipped... usually an update on a 
particular procedure has stored step N+1 before step N.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-08-22 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588571#comment-16588571
 ] 

Allan Yang commented on HBASE-18152:


My test env reproduced this issue again with HBASE-20939. Maybe this issue is 
not totally fixed by it.
{code}
2018-08-22 15:57:35,010 ERROR [Thread-14] procedure2.ProcedureExecutor(388): 
Corrupt pid=16633, ppid=16510, state=WAITING_TIMEOUT:REGION_TRANSITION_QUEUE, 
hasLock=false; AssignProcedure table=IntegrationTestBigLinkedList, 
region=3407f9af348ff198a3bdd3a9ae7db02c
{code}

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-26 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559196#comment-16559196
 ] 

stack commented on HBASE-18152:
---

I think HBASE-20939 uncovers the root of corruption I've been trying to figure 
in here.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-24 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555146#comment-16555146
 ] 

stack commented on HBASE-18152:
---

A few notes:

 * I spent a good part of today trying to reproduce but was unable (see  
[^0001-TestWALProcedureExecutore-order-checking-test-that-d.patch] ). Even with 
high-concurrency and artificial friction, 160 concurrent worker threads always 
wrote in-order against default 16 slots.
 * Interesting is that a Worker thread will usually try and run a Procedure 
through to the end UNLESS it suspends or returns true from its 
isYieldAfterExecutionStep implementation. In this case, the AssignProcedure on 
the *Staring* step suspends itself as part of normal operation; a thread in the 
AssignmentManager is in charge of wakening the assign. This is how we make sure 
meta goes out first and how we batch up assigning. This technique makes the 
AssignProcedure in particular susceptible to a thread-switch when we move to 
*Dispatch*.
 * Let me see what would take to get rid of this WALProcedureStore. It is doing 
what we do elsewhere writing WAL but using different primitives (Conditions) 
with recovery and WAL rolling.
 



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-24 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554562#comment-16554562
 ] 

Josh Elser commented on HBASE-18152:


{quote}Looking at this was procedure store I'm currently trying to figure why 
we have a third way of writing wals and was thinking of putting in place fshlog 
or asyncfs.
{quote}
Strong +1 on that. I know Enis was frustrated about this way back when.
{quote}I appreciate the sort suggestion. I am not sure the edits we're part of 
the same batch so am unsure it would help. Might be worth trying though?
{quote}
Yeah, could be hair-brained. Not sure yet.

I'll give it another look and see what I can come up with. I was trying to come 
up with something that doesn't just funnel our proc executors into a single 
synchronized block. But... maybe I don't need to be worried about that? I don't 
have a good feeling for how critical this part is – simpler fix might be to 
just hoist up the synchronization a level higher (e.g. out of pushData and into 
insert, update, delete)

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-24 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554540#comment-16554540
 ] 

stack commented on HBASE-18152:
---

[~elserj] thank you for jumping in. I appreciate being able to chat on this.  
I'm trying to make a test but it's a bit hard manufacturing. Looking at this 
was procedure store I'm currently trying to figure why we have a third way of 
writing wals and was thinking of putting in place fshlog or asyncfs.

I appreciate the sort suggestion. I am not sure the edits we're part of the 
same batch so am unsure it would help.  Might be worth trying though?

Any other ideas welcome.

Thanks.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-24 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554529#comment-16554529
 ] 

Josh Elser commented on HBASE-18152:


{quote}Notice how they are on different worker threads. The procedures 
"execute" in the correct order but the edits land in the store in another 
order, something quiet possible given the distance between execute, 
post-execution cleanup, and store.
{quote}
Good find, sir!
{quote}Looking at code, when we write we first fill one of a fixed number of 
"slots" with the serialized update of Procedure STEP (See 
WALProcedureStore#insert/#update). We then try to 'push' the data to the WAL 
(See WPS#pushData). Inside in pushData is a reentrant lock to ensure 
single-threaded-access updating WAL. During a startup, we can burst to have 10x 
the Workers we usually have. The "slot" number remains constant. So contention 
for slots and then to obtain the reentrant lock. Reentrant locks make no 
guarantees around the order in which threads get scheduled (This is not a 
'fair' reentrant lock and even then, still no guarantees) so the code here is 
suspect.
{quote}
Totally makes sense. But I'm not sure what a "fix" would look like (not sure if 
you've started thinking about that yet, sorry to derail if you haven't)

Right now, each slot holds the serialized Procedure and then the syncLoop will 
take all slots and write them to the file. What if we make the {{ByteSlot[]}} 
instead a {{TreeMap}}, locally sorting each batch of slots 
(really, Procedures), that we're going to group commit to the WAL?

I think that would work since we our Slot never holds more than one Procedure 
at a time (multiple subProcs, but I'm not sure if that matters?).

Tell me to pump the brakes if I'm not being helpful :)

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-24 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553919#comment-16553919
 ] 

stack commented on HBASE-18152:
---

ProcedureStore depends on updates being ordered. For the above case, the two 
steps -- "Starting" and "Dispatch" -- were run by different Worker threads 
(PEWorker-8 and then PEWorker-10). The two steps are separated in time by ~70ms 
and to run, each needs to take the lock on the region. But the updates went 
into the WAL flipped from how they ran chronologically and how they appeared in 
the log above.

Looking at code, when we write we first fill one of a fixed number of "slots" 
with the serialized update of Procedure STEP (See 
WALProcedureStore#insert/#update). We then try to 'push' the data to the WAL 
(See WPS#pushData). Inside in pushData is a reentrant lock to ensure 
single-threaded-access updating WAL. During a startup, we can burst to have 10x 
the Workers we usually have. The "slot" number remains constant. So contention 
for slots and then to obtain the reentrant lock. Reentrant locks make no 
guarantees around the order in which threads get scheduled (This is not a 
'fair' reentrant lock and even then, still no guarantees) so the code here is 
suspect.

Trying to write a test to prove flawed ordering/corruption.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-07-20 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551344#comment-16551344
 ] 

stack commented on HBASE-18152:
---

Saw this again. Here is current example:

{code}
2018-07-17 22:33:03,678 WARN  [Thread-18] wal.ProcedureWALFormatReader: NOT 
INCREASING! current=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 3902
proc_id: 3906
submitted_time: 1531890545613
owner: "stack"
state: RUNNABLE
stack_id: 6
stack_id: 16
last_update: 1531890545980
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\\0008\000"
}
, candidate=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 3902
proc_id: 3906
submitted_time: 1531890545613
owner: "stack"
state: RUNNABLE
stack_id: 6
last_update: 1531890545812
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\\0008\000"
}
...
{code}

We have written the entries backwards indeed.

Looking at the logging for this particular procedure, I see the below entries 
which correspond to the last_update times:

{code}

2018-07-17 22:09:05,812 INFO  [PEWorker-8] assignment.AssignProcedure: Starting 
pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; 
rit=OFFLINE, location=ve0538.halxg.cloudera.com,16020,1531890104902; 
forceNewPlan=false, retain=true

2018-07-17 22:09:05,980 INFO  [PEWorker-10] 
assignment.RegionTransitionProcedure: Dispatch pid=3906, ppid=3902, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; 
rit=OPENING, location=ve0536.halxg.cloudera.com,16020,1531889066114
{code}

Notice how they are on different worker threads. The procedures "execute" in 
the correct order but the edits land in the store in another order, something 
quiet possible given the distance between execute, post-execution cleanup, and 
store.



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-06-29 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528042#comment-16528042
 ] 

stack commented on HBASE-18152:
---

I just manufactured a corrupt MasterWALProc by kill -9'ing Master (odd because 
this was a cluster that had just been doing hours of ITBLL with server 
killings). Recovery of the last WAL seems pretty crappy too. We seem to be 
missing 100 pids off the end. Is it the case that any WAL cut off on its tail 
generates "NOT INCREASING" messages? Let me check...

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-06-20 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518554#comment-16518554
 ] 

stack commented on HBASE-18152:
---

Had another incidence here on internal test rig.

Procedure updates seem to have gone in in reversed order The 'last update' 
time and the list of stack ids -- the sub-procedures the root needs to roll 
back are out of order with the first update having two stack ids where the one 
that comes after has one (see example below) A bunch of sub-procedure 
Assigns of a SCP were written this way.  I'm having difficulty because the 
update to the Store is under region lock so the updates should be in order 
but timeout doesn't have lock.

The incident that brought this on does not show in the logs because a STUCK 
spew rolled away the logs. In another issue I tune down the STUCK logging. 
Hopefully it helps. Would like to figure what the 'event' was. Would help 
narrow which path brought this about (also my master WAL doesn't show the 
events that were a problem... They seem to have been cut off... the file was 
not closed nicely).

{code}
2018-06-14 12:57:16,956 WARN 
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader: NOT 
INCREASING! current=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 864403
proc_id: 916353
submitted_time: 1528730391463
owner: "hbase"
state: RUNNABLE
stack_id: 251
stack_id: 332
last_update: 1528730393598
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\\0008\000"
}
, candidate=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 864403
proc_id: 916353
submitted_time: 1528730391463
owner: "hbase"
state: RUNNABLE
stack_id: 251
last_update: 1528730393404
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\\0008\000"
}
{code}



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-04-13 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437836#comment-16437836
 ] 

Josh Elser commented on HBASE-18152:


Oh, exciting. I have the pv2 wal still in oldWALs too. Let me look at that.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-02.hwx.site.log.gz, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2018-04-13 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437835#comment-16437835
 ] 

Josh Elser commented on HBASE-18152:


I think I saw something similar to this, but it may be less of the bug Stack 
outlines and more self-inflicted. I was trying out 2.0.0-rc0 on Hadoop 3.

The version of Hadoop3 I had installed doesn't work with the asyncwal, which I 
noticed after tryign to start up the first time. Then, I realized I needed to 
set the meta-wal provider configuration option in addition to the normal wal 
provider option. IIRC, {{stop-hbase.sh}} hung so I ended up {{kill}}'ing the 
process. This caused a bunch of WAL split failures in the Master to get logged 
around {{2018-04-12 21:43:01}}.

Interestingly, the Master eventually started complaining around the region 
transition that the connection was null (because the server was stopping, I 
assume)
{code:java}
2018-04-12 21:43:05,781 WARN  [PEWorker-8] 
assignment.RegionTransitionProcedure: Retryable error trying to transition: 
pid=3586, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=TestTable, region=0e5c6da974e1d3753ba0cb6a8f0db5b1; rit=OPENING, 
location=ctr-e138-1518143905142-221855-01-09.hwx.site,16020,1523564755147
java.io.InterruptedIOException: Interrupted after 7 tries while maxAttempts=31
    at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:173)
    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
    at 
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1076)
    at 
org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:213)
    at 
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:191)
    at 
org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:163)
    at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:295)
    at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85)
    at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1453)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
{code}
After lots of these, I think I eventually kill'ed the Master and restarted it. 
When it came back, I got all of the corrupt pv2 wal errors:
{noformat}
2018-04-12 22:05:24,050 INFO  
[master/ctr-e138-1518143905142-221855-01-02:16000] util.FSHDFSUtils: 
Recover lease on dfs file 
hdfs://ctr-e138-1518143905142-221855-01-02.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-0008.log
2018-04-12 22:05:24,054 INFO  
[master/ctr-e138-1518143905142-221855-01-02:16000] util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://ctr-e138-1518143905142-221855-01-02.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-0008.log
 after 3ms
2018-04-12 22:05:24,103 INFO  
[master/ctr-e138-1518143905142-221855-01-02:16000] 
procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 55msec
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-02:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3580, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, 
region=71fefffe6b5b3cf1cb6d3328a5a58690
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-02:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3586, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, 
region=0e5c6da974e1d3753ba0cb6a8f0db5b1
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-02:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3593, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, 
region=f19fbb2563464bcafd528eeba863c10f
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-02:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3592, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, 
region=7df42ee12cc6ff354144e7915e6f0c5a
{noformat}
Will attach the relevant master log. I have ITBLL running now, but will try to 
reproduce these steps and see if I can get something reliably happening.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -

[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-08-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129073#comment-16129073
 ] 

stack commented on HBASE-18152:
---

Made this a standalone issue.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-07-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081631#comment-16081631
 ] 

Hudson commented on HBASE-18152:


FAILURE: Integrated in Jenkins build HBase-1.4 #803 (See 
[https://builds.apache.org/job/HBase-1.4/803/])
Revert "HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure 
(apurtell: rev 844596e09e7961f506ab2e8bb4aee5557e1b4a44)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java


> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071058#comment-16071058
 ] 

Hudson commented on HBASE-18152:


FAILURE: Integrated in Jenkins build HBase-1.4 #796 (See 
[https://builds.apache.org/job/HBase-1.4/796/])
HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: 
rev 8ce3f49f8034417734ef1a3e6c624a601a22ace9)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java


> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052644#comment-16052644
 ] 

stack commented on HBASE-18152:
---

Looks like we have a version of this problem in branch-1 too. This is from a 
[~tsuna] 1.3.1 log:

{code}
2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] 
procedure2.ProcedureExecutor: corrupted procedure: 
Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=96, 
owner=, state=RUNNABLE, startTime=6480hrs, 32mins, 51sec ago, 
lastUpdate=6480hrs, 32mins, 51sec ago)
2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] 
procedure2.ProcedureExecutor: corrupted procedure: 
Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=15, 
owner=, state=RUNNABLE, startTime=7032hrs, 28mins, 23sec ago, 
lastUpdate=7032hrs, 28mins, 23sec ago)
2017-06-09 01:03:34,499 ERROR [r12s3:9102.activeMasterManager] 
procedure2.ProcedureExecutor: corrupted procedure: 
Procedure=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure (id=77, 
owner=, state=RUNNABLE, startTime=7032hrs, 21mins, 11sec ago, 
lastUpdate=7032hrs, 21mins, 11sec ago)
{code}

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048833#comment-16048833
 ] 

Hudson commented on HBASE-18152:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3192 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/3192/])
HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: 
rev 0b43353bf76f19e020e2831691a832722b590915)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java
HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL; (stack: 
rev 550b6c585e0390bc80516e64df8bd1a3a6e10e23)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java


> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048805#comment-16048805
 ] 

Hudson commented on HBASE-18152:


FAILURE: Integrated in Jenkins build HBase-2.0 #40 (See 
[https://builds.apache.org/job/HBase-2.0/40/])
HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL (stack: 
rev 1eb2f2593d00b6e92f31fe9811fdd656fcf2847b)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java
HBASE-18216 [AMv2] Workaround for HBASE-18152, corrupt procedure WAL; (stack: 
rev 16a5a9db3edf3594b492f548b78bf8342884373f)
* (edit) 
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFormatReader.java


> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048704#comment-16048704
 ] 

stack commented on HBASE-18152:
---

Pushed workaround over in HBASE-18216. Need to figure root cause still.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0036.log, pv2-0047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045790#comment-16045790
 ] 

stack commented on HBASE-18152:
---

Looking at the corruption in 36.log, we are indeed missing stuff off the end of 
the WAL. The missing entries look like they would have been on the end of the 
WAL.  There is a second at least between their event and the master crash. I 
presume in this second its thread messing around trying to persist to the WAL 
store. We are writing edits out-of-order in some circumstance (see experience 
w/ previous log and the attached workaround patch which helps...). This makes 
for possibility of their being holes if we expect events in-order as the smart 
verification check does. Need to dig in on the way we log.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-17537.master.002.patch, 
> HBASE-18152.master.001.patch, pv2-0036.log, 
> pv2-0047.log, reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

2017-06-02 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1603#comment-1603
 ] 

stack commented on HBASE-18152:
---

This is what corrupt procedures show up as:
{code}
2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager] 
procedure2.ProcedureExecutor: Corrupt 
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=16, 
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, 
lastUpdate=14sec ago)   

2017-04-17 22:34:50,044 ERROR 
[ve0524:16000.masterManager] procedure2.ProcedureExecutor: Corrupt 
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=13, 
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, 
lastUpdate=14sec ago)
2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager] 
procedure2.ProcedureExecutor: Corrupt 
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=11, 
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago, 
lastUpdate=14sec ago)
{code}

Attached patch is a workaround that at read time checks that the new-found 
entry is for sure 'increasing' when compared to current entry we have for a 
Procedure. If not, we WARN and drop it. This workaround is good for the 
corruption shown here. Will run with it to see if I can find other corruption 
types and to see if I can figure how we are writing out of order.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> -
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch, 
> pv2-0047.log, reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)