tianhang tang created HBASE-23693:
-------------------------------------

             Summary: Split failure may cause region hole and data loss
                 Key: HBASE-23693
                 URL: https://issues.apache.org/jira/browse/HBASE-23693
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 1.4.8
            Reporter: tianhang tang


to mock this case, I add a sleep code in SplitTransactionImpl.excute after the 
PONR and before openDaughters:
{code:java}
public PairOfSameType<Region> execute(final Server server,
      final RegionServerServices services, User user) throws IOException {
    this.server = server;
    this.rsServices = services;
    useZKForAssignment = server == null ? true :
      ConfigUtil.useZKForAssignment(server.getConfiguration());
    if (useCoordinatedStateManager(server)) {
      std =
          ((BaseCoordinatedStateManager) server.getCoordinatedStateManager())
              .getSplitTransactionCoordination().getDefaultDetails();
    }
    PairOfSameType<Region> regions = createDaughters(server, services, user);
    if (this.parent.getCoprocessorHost() != null) {
      if (user == null) {
        parent.getCoprocessorHost().preSplitAfterPONR();
      } else {
        try {
          user.getUGI().doAs(new PrivilegedExceptionAction<Void>() {
            @Override
            public Void run() throws Exception {
              parent.getCoprocessorHost().preSplitAfterPONR();
              return null;
            }
          });
        } catch (InterruptedException ie) {
          InterruptedIOException iioe = new InterruptedIOException();
          iioe.initCause(ie);
          throw iioe;
        }
      }
    }
    
    //sleep here!!!
    try {
      Thread.sleep(1000 * 60 * 60);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

    regions = stepsAfterPONR(server, services, regions, user);

    transition(SplitTransactionPhase.COMPLETED);

    return regions;
  }
{code}
so the split transaction will hang.

then i try to reproduce this problem:

1.Create a test table and move it into a test rsgroup, there is only 1 RS in 
the test group

2.Trigger a region split

3.The split transaction step after the PONR and sleep, regioninfo in meta has 
been updated

4.Kill the RS process to mock machine crash

5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will 
be deleted

6.ServerCrashProcedure try to assign the parent region, because RS is down and 
assign fails, the region status is set to FAILED_OPEN and put back into 
regionsInTransition. But at this time, due to RS crash, the node of the region 
under ZK region-in-transition no longer exist

7.CatalogJanitor thread is blocked due to RIT

8.Switch active master

9.The CatalogJanitor thread on the new master executes normally and the parent 
region is cleaned up because split = true && offline = true in the meta table

10.We have a hole in the test table and loss data.

 

I modified the code when ServerCrashProcedure cleans up the child regions, it 
will update the parent regioninfo in the meta table, and this problem is no 
longer reproduced.


I will upload the patch later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to