Todd Lipcon created KUDU-2152:
---------------------------------

             Summary: Tablet stuck under-replicated after some kind of tablet 
copy issue
                 Key: KUDU-2152
                 URL: https://issues.apache.org/jira/browse/KUDU-2152
             Project: Kudu
          Issue Type: Bug
          Components: consensus
    Affects Versions: 1.5.0
            Reporter: Todd Lipcon
            Priority: Critical


I was stress testing with the following setup:
- 8 servers (n1-standard-4 GCE boxes)
- created a bunch of 100-tablet tablets using loadgen until I had ~2500 
replicas on each server
- mounted another server using sshfs and put cmeta on that mount point (to make 
slower cmeta writes)
- stress -c4 on all machines
- shut down a server and wait for re-replication (green ksck), restart the 
server, rinse repeat

Eventually I got a stuck tablet. ksck reports:

{code}
Tablet 271df8901d98442cb478593babd8a609 of table 
'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 
replica(s) not RUNNING
  20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING 
[LEADER]
  c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad state
    State:       STOPPED
    Data state:  TABLET_DATA_COPYING
    Last status: Deleted tablet blocks from disk
  cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING

1 replicas' active configs differ from the master's.
  All the peers reported by the master and tablet servers are:
  A = 20d4d86f182043398594b67492d13fdc
  D = 471027436ee8405ab7cdf8d22407696b
  B = c2ea8f22f4034bcc97e26c9236811960
 
 C = cd0997b908ad41839f56a1b61210f2d4

The consensus matrix is:
 Config source |      Voters      | Current term | Config index | Committed?
---------------+------------------+--------------+--------------+------------
 master        | A*      B   C    |              |              | Yes
 A             | A*      B   C    | 11           | 29           | Yes
 B             |     D   B   C    | 9            | 23           | Yes
 C             | A*      B   C    | 11           | 29           | Yes
{code}

The leader ("A" above) just keeps reporting that it's failing to send requests 
to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts it (the 
leader treats TABLET_NOT_RUNNING as a temporary condition assuming that it 
actually means BOOTSTRAPPING).

"B"'s last bit in the logs were:

{code}
I0920 16:41:48.556422  3808 tablet_copy_client.cc:209] T 
271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet 
copy: Beginning tablet copy session from remote peer at address 
kudu513-8.gce.cloudera.com:7050
I0920 16:41:48.562335  3808 ts_tablet_manager.cc:1118] T 
271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting 
tablet data with delete state TABLET_DATA_COPYING
W0920 16:41:48.578610  3808 env_util.cc:277] Failed to determine if path is a 
directory: 
/data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: Not 
found: 
/data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: No 
such file or directory (error 2)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to