Wellington Chevreuil created HBASE-21461:
--------------------------------------------

             Summary: Region CoProcessor for splitting large WAL entries in 
smaller batches, to handle situation when faulty ingestion had created too many 
mutations for same cell in single batch
                 Key: HBASE-21461
                 URL: https://issues.apache.org/jira/browse/HBASE-21461
             Project: HBase
          Issue Type: New Feature
          Components: hbase-operator-tools, Replication
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


With replication enabled deployments, it's possible that faulty ingestion 
clients may lead to single WalEntry containing too many edits for same cell. 
This would cause *ReplicationSink,* in the target cluster, to attempt single 
batch mutation with too many operations, what in turn can lead to very large 
RPC requests, which may not fit in the final target RS rpc queue. In this case, 
the messages below are seen on target RS trying to perform the sink:
{noformat}

WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
attempt=4/4 failed=2ops, last exception: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
 Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
too small? on regionserver01.example.com,60020,1524334173359, tracking started 
Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
accept edit because:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
actions: RemoteWithExtrasException: 2 times, 
at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}

When this problem manifests, replication will be stuck and wal files will be 
piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
files containing the large entry.

This CP would handle the issue, by checking for large wal entries and splitting 
those into smaller batches on the *reReplicateLogEntries* method hook.

*Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
requests, which may already help avoid such scenario. That is not available for 
1.2 releases, though, and this CP tool may still be relevant for 1.2 clusters. 
It may also be still worth having it to workaround any potential unknown large 
RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to