[jira] [Commented] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2022-10-10 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615265#comment-17615265
 ] 

Sandeep Pal commented on HBASE-25848:
-

This solved the problem since we used it in the our customer replication 
endpoint which was using the custom WALEntryFilter.

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 1.8.0, 2.6.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.3.6, 2.4.4
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-06-10 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360612#comment-17360612
 ] 

Sandeep Pal commented on HBASE-25596:
-

Sure, if this is true "we will never throw EOFException to upper layer, so the 
problem you described here should not happen", then it make sense that there 
will be no unshipped batch. I was not aware of this. Let me check the code once 
more carefully and I will create a PR to revert it. Thanks [~zhangduo]

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-06-09 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360571#comment-17360571
 ] 

Sandeep Pal commented on HBASE-25596:
-

I will try to write a UT for this. 

>>To be more clear, if we hit EOFException, it means that the WAL file is 
>>empty, so we are safe to just skip this file without shipping the 'existing' 
>>batch

I believe there can be entries for multiple WAL files in an existing batch, I 
am referring to branch-1 code. We only break the batch when we hit some 
thresholds as in the below code. 
{code:java}
  if (totalBufferTooLarge || batch.getHeapSize() >= 
replicationBatchSizeCapacity
|| batch.getNbEntries() >= replicationBatchCountCapacity) {
break;
  }
{code}

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-06-09 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360525#comment-17360525
 ] 

Sandeep Pal commented on HBASE-25596:
-

[~zhangduo] This is where I think we will not replicate. 

 

 
{code:java}
while (hasNext) {
  Entry entry = entryStream.next(); <---we hit 
an exception here
  entry = filterEntry(entry);
  if (entry != null) {
WALEdit edit = entry.getEdit();
if (edit != null && !edit.isEmpty()) {
  long entrySize = getEntrySizeIncludeBulkLoad(entry);
  long entrySizeExcludeBulkLoad = 
getEntrySizeExcludeBulkLoad(entry);
  batch.addEntry(entry, entrySize);  
= 
replicationBatchSizeCapacity
|| batch.getNbEntries() >= replicationBatchCountCapacity) {
break;
  }
}
  }
  hasNext = entryStream.hasNext();
{code}
 

While reading wals we add entries in batch but in between we hit an exception 
let's say in the next empty WAL file. We won't replicate the existing batch 
which might have entries from the previous wal file. I am referring to branch-1 
code 
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L165].
 

 

 

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-06-02 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355899#comment-17355899
 ] 

Sandeep Pal commented on HBASE-25596:
-

[~anoop.hbase] Thanks for catching agree that it's a miss. Infact I believe 
even the original code was also not correct. The sleep should be reset at the 
end anyway even if the batch is null or not.


{code:java}
if (batch == null) {
 // either the queue have no WAL to read
 // or got no new entries (didn't advance position in WAL)
 handleEmptyWALEntryBatch();
 entryStream.reset(); // reuse stream
 } else {
 addBatchToShippingQueue(batch);
 }

reset the sleep here <--{code}
 

Let me know if you want me to fix it or you are already on it. 

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25927) Fix the log messages by not stringifying the exceptions in log

2021-05-28 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25927 started by Sandeep Pal.
---
> Fix the log messages by not stringifying the exceptions in log
> --
>
> Key: HBASE-25927
> URL: https://issues.apache.org/jira/browse/HBASE-25927
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> There are few places where we stringify the exceptions and log, instead we 
> should just pass them as a parameter to see the stack trace in good format. 
> For example: 
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L175



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25927) Fix the log messages by not stringifying the exceptions in log

2021-05-26 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25927:
---

 Summary: Fix the log messages by not stringifying the exceptions 
in log
 Key: HBASE-25927
 URL: https://issues.apache.org/jira/browse/HBASE-25927
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Pal
Assignee: Sandeep Pal


There are few places where we stringify the exceptions and log, instead we 
should just pass them as a parameter to see the stack trace in good format. 

For example: 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L175



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal resolved HBASE-25848.
-
Resolution: Fixed

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7, 1.8.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 1.8.0, 2.3.6, 2.4.4, 2.3.5.1
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25848:

Fix Version/s: 1.8.0
   1.7.0

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7, 1.8.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 1.8.0, 2.3.6, 2.4.4, 2.3.5.1
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25848:

Affects Version/s: (was: 2.0.0)
   1.8.0
   2.6.7

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7, 1.8.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.6, 2.4.4, 2.3.5.1
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25848:

Affects Version/s: (was: 2.6.7)
   2.0.0

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.6.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.6, 2.4.4, 2.3.5.1
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25848:

Fix Version/s: 2.3.5.1
   2.4.4
   2.3.6
   3.0.0-alpha-1

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.6, 2.4.4, 2.3.5.1
>
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-17 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25848 started by Sandeep Pal.
---
> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-17 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25848:

Description: There may be situations when the wal entry filter might result 
in some temporary issues but expected to recover at some point in time. In this 
case, we should have an option to backup replication and retry until the wal 
entry filter recovers instead of just aborting the region server.  (was: There 
may be situations when the wal entry stream might result in some temporary 
issues but expected to recover at some point of time. In this case, we should 
have an option to backup replication and retry until the wal entry filter 
recover instead of just aborting the region server.)

> Add flexibility to backup replication in case replication filter throws an 
> exception
> 
>
> Key: HBASE-25848
> URL: https://issues.apache.org/jira/browse/HBASE-25848
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.6.7
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> There may be situations when the wal entry filter might result in some 
> temporary issues but expected to recover at some point in time. In this case, 
> we should have an option to backup replication and retry until the wal entry 
> filter recovers instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25848) Add flexibility to backup replication in case replication filter throws an exception

2021-05-05 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25848:
---

 Summary: Add flexibility to backup replication in case replication 
filter throws an exception
 Key: HBASE-25848
 URL: https://issues.apache.org/jira/browse/HBASE-25848
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.6.0, 3.0.0-alpha-1, 2.6.7
Reporter: Sandeep Pal
Assignee: Sandeep Pal


There may be situations when the wal entry stream might result in some 
temporary issues but expected to recover at some point of time. In this case, 
we should have an option to backup replication and retry until the wal entry 
filter recover instead of just aborting the region server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25741) Replication Source still having the replication metrics for peer ID which doesn't exist.

2021-04-06 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25741:
---

 Summary: Replication Source still having the replication metrics 
for peer ID which doesn't exist.
 Key: HBASE-25741
 URL: https://issues.apache.org/jira/browse/HBASE-25741
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Sandeep Pal
Assignee: Sandeep Pal


We have observed that replication source metrics for peer exists on some region 
servers even though peer has been removed.  This is because when we encounter 
the NoNodeException in ReplicationSource, it calls the `peerRemoved` workflow 
which should eventually terminate the source and removes the source from the 
source manager. Now, the problem is ReplicationSource thread terminates itself 
and thus the action to removePeer is not complete leaving the metrics there 
forever for source. This is the flow, replication source trying to clean wals 
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L801]
 and on NoNodeException it calls the 
[peerRemoved|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L244]
 and terminate the source (itself), leaving the terminated source there in 
sourcemanager and not clearing it's 
[metrics|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L645].

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-15 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301425#comment-17301425
 ] 

Sandeep Pal edited comment on HBASE-25664 at 3/15/21, 6:43 AM:
---

[~zhangduo] That is why I specifically mentioned that "I am referring to 
branch-1". I believe in branch-2 and above, we depend upon RPCs instead of ZK 
but I am not fully aware if RPCs can be delayed like ZK. 


was (Author: sandeep.pal):
[~zhangduo] That is why I specifically mentioned that "I am referring to 
branch-1". I believe in branch-2 and above, we depends upon RPCs instead of ZK 
but I am not fully aware if RPCs can be delayed like ZK. 

> Adding replication peer should handle the undeleted queue exception
> ---
>
> Key: HBASE-25664
> URL: https://issues.apache.org/jira/browse/HBASE-25664
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, if we try to add a peer and there is a replication queue existing 
> for that peer, it doesn't let the replication peer created. 
> Instead, we should delete the queue and proceed with the creating of 
> replication peer. Any queue without no corresponding replication peer is 
> useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
> creating the peer. 
>  
> {code:java}
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-15 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301425#comment-17301425
 ] 

Sandeep Pal commented on HBASE-25664:
-

[~zhangduo] That is why I specifically mentioned that "I am referring to 
branch-1". I believe in branch-2 and above, we depends upon RPCs instead of ZK 
but I am not fully aware if RPCs can be delayed like ZK. 

> Adding replication peer should handle the undeleted queue exception
> ---
>
> Key: HBASE-25664
> URL: https://issues.apache.org/jira/browse/HBASE-25664
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, if we try to add a peer and there is a replication queue existing 
> for that peer, it doesn't let the replication peer created. 
> Instead, we should delete the queue and proceed with the creating of 
> replication peer. Any queue without no corresponding replication peer is 
> useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
> creating the peer. 
>  
> {code:java}
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-14 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301401#comment-17301401
 ] 

Sandeep Pal commented on HBASE-25664:
-

I believe it is this _"It is still being deleted in the background"._

This is because queue removal depends upon zk listener being invoked after the 
peer node gets deleted. This may get delayed since there is no guarantee of 
when the zk notification will be there(I am referring to branch-1).

 >This may cause trouble for the RemovePeerProcedure?
I believe we need to handle it accordingly if the queue is already deleted. 
But IMO, hbase should not prevent peer creation if a queue exists.

Let me know what you think?

> Adding replication peer should handle the undeleted queue exception
> ---
>
> Key: HBASE-25664
> URL: https://issues.apache.org/jira/browse/HBASE-25664
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, if we try to add a peer and there is a replication queue existing 
> for that peer, it doesn't let the replication peer created. 
> Instead, we should delete the queue and proceed with the creating of 
> replication peer. Any queue without no corresponding replication peer is 
> useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
> creating the peer. 
>  
> {code:java}
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-14 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301369#comment-17301369
 ] 

Sandeep Pal commented on HBASE-25664:
-

[~zhangduo]Do you mean deleting the replication queues is time-consuming? I was 
thinking we should just remove the znodes for queues but not the actual WALs, 
the WALs can later be cleaned later by `ReplicationLogCleaner`. This is to make 
sure the latest peer doesn't replicate anything for the previously existing 
peers. What do you think? 

> Adding replication peer should handle the undeleted queue exception
> ---
>
> Key: HBASE-25664
> URL: https://issues.apache.org/jira/browse/HBASE-25664
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, if we try to add a peer and there is a replication queue existing 
> for that peer, it doesn't let the replication peer created. 
> Instead, we should delete the queue and proceed with the creating of 
> replication peer. Any queue without no corresponding replication peer is 
> useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
> creating the peer. 
>  
> {code:java}
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-14 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25664:

Description: 
Currently, if we try to add a peer and there is a replication queue existing 
for that peer, it doesn't let the replication peer created. 

Instead, we should delete the queue and proceed with the creating of 
replication peer. Any queue without no corresponding replication peer is 
useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
creating the peer. 

 
{code:java}
org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
queueId: xyz_peerid
java.lang.RuntimeException: 
org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
queueId: xyz_peerid

{code}

  was:
Currently, if we try to add a peer and there is a replication queue existing 
for that peer, it doesn't let the replication peer created. 

Instead, we should delete the queue and proceed with the creating of 
replication peer. Any queue without no corresponding replication peer is 
useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
creating the peer. 


> Adding replication peer should handle the undeleted queue exception
> ---
>
> Key: HBASE-25664
> URL: https://issues.apache.org/jira/browse/HBASE-25664
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, if we try to add a peer and there is a replication queue existing 
> for that peer, it doesn't let the replication peer created. 
> Instead, we should delete the queue and proceed with the creating of 
> replication peer. Any queue without no corresponding replication peer is 
> useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
> creating the peer. 
>  
> {code:java}
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for 
> peerId: xyz_peerid, replicator: hostname.fakeaddress.com,60020,1607576586258, 
> queueId: xyz_peerid
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25664) Adding replication peer should handle the undeleted queue exception

2021-03-14 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25664:
---

 Summary: Adding replication peer should handle the undeleted queue 
exception
 Key: HBASE-25664
 URL: https://issues.apache.org/jira/browse/HBASE-25664
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Currently, if we try to add a peer and there is a replication queue existing 
for that peer, it doesn't let the replication peer created. 

Instead, we should delete the queue and proceed with the creating of 
replication peer. Any queue without no corresponding replication peer is 
useless anyway. So, we shouldn't wait for cleaner to come and clean it before 
creating the peer. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25627) HBase replication should have a metric to represent if it cannot talk to peer's zk

2021-03-04 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295674#comment-17295674
 ] 

Sandeep Pal commented on HBASE-25627:
-

Following up the discussion 
[here|https://github.com/apache/hbase/pull/3009#discussion_r586815267] , 
[~bharathv] suggestion is to have a metric at a source initialization level 
instead which will ultimately keep track of zk peer connection issue. 

We can keep track number of sources getting initialized and if they are stuck, 
there can be monitoring on the metric.

> HBase replication should have a metric to represent if it cannot talk to 
> peer's zk
> --
>
> Key: HBASE-25627
> URL: https://issues.apache.org/jira/browse/HBASE-25627
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> There can be situation when the cluster is not able to talk to peer cluster 
> ZK, in that case, yes the logQueue will be accumulating but without digging 
> into the logs, we cannot know what's the reason of loqQueue getting 
> accumulating on the source. 
> Since the replication source doesn't even start the shipper in this case, it 
> is good to have a dedicated metric if the RS cannot talk to the peer's ZK at 
> all. 
>  
> {code:java}
> 2021-03-03 04:02:10,704 DEBUG [peerId] zookeeper.RecoverableZooKeeper - 
> Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid2021-03-03 04:02:10,704 DEBUG 
> [peerId] zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for 
> /hbase/hbaseidorg.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1119) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:284)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:469) at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>  at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
>  at 
> org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25627) HBase replication should have a metric to represent if it cannot talk to peer's zk

2021-03-02 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25627 started by Sandeep Pal.
---
> HBase replication should have a metric to represent if it cannot talk to 
> peer's zk
> --
>
> Key: HBASE-25627
> URL: https://issues.apache.org/jira/browse/HBASE-25627
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> There can be situation when the cluster is not able to talk to peer cluster 
> ZK, in that case, yes the logQueue will be accumulating but without digging 
> into the logs, we cannot know what's the reason of loqQueue getting 
> accumulating on the source. 
> Since the replication source doesn't even start the shipper in this case, it 
> is good to have a dedicated metric if the RS cannot talk to the peer's ZK at 
> all. 
>  
> {code:java}
> 2021-03-03 04:02:10,704 DEBUG [peerId] zookeeper.RecoverableZooKeeper - 
> Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid2021-03-03 04:02:10,704 DEBUG 
> [peerId] zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for 
> /hbase/hbaseidorg.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1119) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:284)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:469) at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>  at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
>  at 
> org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25627) HBase replication should have a metric to represent if it cannot talk to peer's zk

2021-03-02 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25627:
---

 Summary: HBase replication should have a metric to represent if it 
cannot talk to peer's zk
 Key: HBASE-25627
 URL: https://issues.apache.org/jira/browse/HBASE-25627
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Pal
Assignee: Sandeep Pal


There can be situation when the cluster is not able to talk to peer cluster ZK, 
in that case, yes the logQueue will be accumulating but without digging into 
the logs, we cannot know what's the reason of loqQueue getting accumulating on 
the source. 

Since the replication source doesn't even start the shipper in this case, it is 
good to have a dedicated metric if the RS cannot talk to the peer's ZK at all. 

 
{code:java}
2021-03-03 04:02:10,704 DEBUG [peerId] zookeeper.RecoverableZooKeeper - 
Possibly transient ZooKeeper, 
quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
 exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
KeeperErrorCode = AuthFailed for /hbase/hbaseid2021-03-03 04:02:10,704 DEBUG 
[peerId] zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper, 
quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
 exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
KeeperErrorCode = AuthFailed for 
/hbase/hbaseidorg.apache.zookeeper.KeeperException$AuthFailedException: 
KeeperErrorCode = AuthFailed for /hbase/hbaseid at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1119) at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:284)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:469) at 
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
 at 
org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
 at 
org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104)
 at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-03-02 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293980#comment-17293980
 ] 

Sandeep Pal commented on HBASE-25596:
-

Sorry please scratch that, will create the PR. I got confused for another jira

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-03-02 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293979#comment-17293979
 ] 

Sandeep Pal commented on HBASE-25596:
-

This was only for branch-1, do you mean recreate PR for branch1?

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25613) [Branch-2 and Master]Handle the NoNode exception in remove log replication in a better way then just log WARN

2021-02-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25613:

Description: 
Currently, when we see the NoNodeException in the replication source while 
removing a log from ZK, we swallow that exception and log WARN. 

 

In certain cases, we might have the peer removed and corresponding logs 
removing as well but the replication source continuous to run because of an RPC 
failure or anything. 

In stead of just log WARN we should check if the peer is removed, if it is the 
case, we should terminate the source or try to execute the removePeer workflow 
again.

 

This would prevent the orphaned source execution infinitely. 

  was:
Currently, when we see the NoNodeException in the replication source while 
removing a log from ZK, we swallow that exception and log WARN. 

 

In certain cases, we might have the peer removed and corresponding logs 
removing as well but the replication source continuous to run because of an RPC 
failure or anything. 

In stead of just log WARN we should check if the peer is removed, it it we 
should terminate the source or try to execute the removePeer workflow again.

 

This would prevent the orphaned source execution infinitely. 


> [Branch-2 and Master]Handle the NoNode exception in remove log replication in 
> a better way then just log WARN
> -
>
> Key: HBASE-25613
> URL: https://issues.apache.org/jira/browse/HBASE-25613
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> Currently, when we see the NoNodeException in the replication source while 
> removing a log from ZK, we swallow that exception and log WARN. 
>  
> In certain cases, we might have the peer removed and corresponding logs 
> removing as well but the replication source continuous to run because of an 
> RPC failure or anything. 
> In stead of just log WARN we should check if the peer is removed, if it is 
> the case, we should terminate the source or try to execute the removePeer 
> workflow again.
>  
> This would prevent the orphaned source execution infinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid RS crash

2021-02-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25583:

Priority: Critical  (was: Major)

> Handle the NoNode exception in remove log replication and avoid RS crash
> 
>
> Key: HBASE-25583
> URL: https://issues.apache.org/jira/browse/HBASE-25583
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 1.7.0
>
>
> Should not crash the region server it there is a NoNode exception while 
> removing the log
>  We should look into the excpetion and if it is NoNode we shouldn't crash. 
> There might be a possiblity the node was deleted as part of peer tear down.
> {code:java}
> @Override
> public void removeLog(String queueId, String filename) {
> try { 
>   String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
>   znode = ZKUtil.joinZNode(znode, filename); 
> ZKUtil.deleteNode(this.zookeeper, znode); }
> catch (KeeperException e) { 
>   this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId 
> + ", filename=" + filename + ")", e); }
> }
> {code}
> This was the exception observed on region servers:
> {code:java}
> 2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer 
> - ABORTING region server regionserver-111,60020,1613495922885: Failed to 
> remove wal from queue (queueId=xyz_peer, 
> filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /hbase/replication/rs/regionserver-111,60020,1613495922885/xyz_peer/regionserver-111%2C60020%2C1613495922885.16135058630
> 58
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:890)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:238)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1341)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1330)
> at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeLog(ReplicationQueuesZKImpl.java:142)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:232)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(Replica
> tionSourceManager.java:198)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogP
> osition(ReplicationSource.java:831)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(
> ReplicationSource.java:746)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(Replic
> ationSource.java:650)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25613) [Branch-2 and Master]Handle the NoNode exception in remove log replication in a better way then just log WARN

2021-02-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25613:

Summary: [Branch-2 and Master]Handle the NoNode exception in remove log 
replication in a better way then just log WARN  (was: Handle the NoNode 
exception in remove log replication in a better way then just log WARN)

> [Branch-2 and Master]Handle the NoNode exception in remove log replication in 
> a better way then just log WARN
> -
>
> Key: HBASE-25613
> URL: https://issues.apache.org/jira/browse/HBASE-25613
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> Currently, when we see the NoNodeException in the replication source while 
> removing a log from ZK, we swallow that exception and log WARN. 
>  
> In certain cases, we might have the peer removed and corresponding logs 
> removing as well but the replication source continuous to run because of an 
> RPC failure or anything. 
> In stead of just log WARN we should check if the peer is removed, it it we 
> should terminate the source or try to execute the removePeer workflow again.
>  
> This would prevent the orphaned source execution infinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25613) Handle the NoNode exception in remove log replication in a better way then just log WARN

2021-02-26 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25613:
---

 Summary: Handle the NoNode exception in remove log replication in 
a better way then just log WARN
 Key: HBASE-25613
 URL: https://issues.apache.org/jira/browse/HBASE-25613
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Pal
Assignee: Sandeep Pal
 Fix For: 3.0.0-alpha-1, 2.5.0


Currently, when we see the NoNodeException in the replication source while 
removing a log from ZK, we swallow that exception and log WARN. 

 

In certain cases, we might have the peer removed and corresponding logs 
removing as well but the replication source continuous to run because of an RPC 
failure or anything. 

In stead of just log WARN we should check if the peer is removed, it it we 
should terminate the source or try to execute the removePeer workflow again.

 

This would prevent the orphaned source execution infinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid RS crash

2021-02-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25583:

Summary: Handle the NoNode exception in remove log replication and avoid RS 
crash  (was: Handle the NoNode exception in remove log replication)

> Handle the NoNode exception in remove log replication and avoid RS crash
> 
>
> Key: HBASE-25583
> URL: https://issues.apache.org/jira/browse/HBASE-25583
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 1.7.0
>
>
> Should not crash the region server it there is a NoNode exception while 
> removing the log
>  We should look into the excpetion and if it is NoNode we shouldn't crash. 
> There might be a possiblity the node was deleted as part of peer tear down.
> {code:java}
> @Override
> public void removeLog(String queueId, String filename) {
> try { 
>   String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
>   znode = ZKUtil.joinZNode(znode, filename); 
> ZKUtil.deleteNode(this.zookeeper, znode); }
> catch (KeeperException e) { 
>   this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId 
> + ", filename=" + filename + ")", e); }
> }
> {code}
> This was the exception observed on region servers:
> {code:java}
> 2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer 
> - ABORTING region server regionserver-111,60020,1613495922885: Failed to 
> remove wal from queue (queueId=xyz_peer, 
> filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /hbase/replication/rs/regionserver-111,60020,1613495922885/xyz_peer/regionserver-111%2C60020%2C1613495922885.16135058630
> 58
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:890)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:238)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1341)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1330)
> at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeLog(ReplicationQueuesZKImpl.java:142)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:232)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(Replica
> tionSourceManager.java:198)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogP
> osition(ReplicationSource.java:831)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(
> ReplicationSource.java:746)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(Replic
> ationSource.java:650)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid crash

2021-02-23 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289358#comment-17289358
 ] 

Sandeep Pal commented on HBASE-25583:
-

[~apurtell] We don't abort the RS in branch-2 and above for `NoNodeException` 
but there is an improvement I see which is terminating the source if no peer 
exists in the source path for `NoNodeException`. Also, branch-2 and above don't 
rely on ZK listener for peer removed, it's RPCs from master from RSes.

If the same thing would happen on branch-2 and master, it would keep the source 
running until the RS start which is not the right thing IMO. 

> Handle the NoNode exception in remove log replication and avoid crash
> -
>
> Key: HBASE-25583
> URL: https://issues.apache.org/jira/browse/HBASE-25583
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2
>
>
> Should not crash the region server it there is a NoNode exception while 
> removing the log
>  We should look into the excpetion and if it is NoNode we shouldn't crash. 
> There might be a possiblity the node was deleted as part of peer tear down.
> {code:java}
> @Override
> public void removeLog(String queueId, String filename) {
> try { 
>   String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
>   znode = ZKUtil.joinZNode(znode, filename); 
> ZKUtil.deleteNode(this.zookeeper, znode); }
> catch (KeeperException e) { 
>   this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId 
> + ", filename=" + filename + ")", e); }
> }
> {code}
> This was the exception observed on region servers:
> {code:java}
> 2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer 
> - ABORTING region server regionserver-111,60020,1613495922885: Failed to 
> remove wal from queue (queueId=xyz_peer, 
> filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /hbase/replication/rs/regionserver-111,60020,1613495922885/xyz_peer/regionserver-111%2C60020%2C1613495922885.16135058630
> 58
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:890)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:238)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1341)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1330)
> at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeLog(ReplicationQueuesZKImpl.java:142)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:232)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(Replica
> tionSourceManager.java:198)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogP
> osition(ReplicationSource.java:831)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(
> ReplicationSource.java:746)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(Replic
> ationSource.java:650)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid crash

2021-02-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25583 started by Sandeep Pal.
---
> Handle the NoNode exception in remove log replication and avoid crash
> -
>
> Key: HBASE-25583
> URL: https://issues.apache.org/jira/browse/HBASE-25583
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2
>
>
> Should not crash the region server it there is a NoNode exception while 
> removing the log
>  We should look into the excpetion and if it is NoNode we shouldn't crash. 
> There might be a possiblity the node was deleted as part of peer tear down.
> {code:java}
> @Override
> public void removeLog(String queueId, String filename) {
> try { 
>   String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
>   znode = ZKUtil.joinZNode(znode, filename); 
> ZKUtil.deleteNode(this.zookeeper, znode); }
> catch (KeeperException e) { 
>   this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId 
> + ", filename=" + filename + ")", e); }
> }
> {code}
> This was the exception observed on region servers:
> {code:java}
> 2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer 
> - ABORTING region server regionserver-111,60020,1613495922885: Failed to 
> remove wal from queue (queueId=xyz_peer, 
> filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /hbase/replication/rs/regionserver-111,60020,1613495922885/xyz_peer/regionserver-111%2C60020%2C1613495922885.16135058630
> 58
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:890)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:238)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1341)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1330)
> at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeLog(ReplicationQueuesZKImpl.java:142)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:232)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
> ger.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(Replica
> tionSourceManager.java:198)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogP
> osition(ReplicationSource.java:831)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(
> ReplicationSource.java:746)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(Replic
> ationSource.java:650)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-02-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25596:

Description: 
There seems to be a major issue with how we handle the EOF exception from 
WALEntryStream. 

Problem:

When we see EOFException, we try to handle it and remove it from the log queue, 
but we never try to ship the existing batch of entries. *This is a permanent 
data loss in replication.*

 

Secondly, we do not stop the reader on encountering the EOFException and thus 
if EOFException was on the last WAL, we still try to process the WALEntry 
stream and ship the empty batch with lastWALPath set to null. This is the 
reason of NPE as below which *crash* the region server. 
{code:java}
2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
regionserver.ReplicationSource - Unexpected exception in 
ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: 
Unexpected exception in ReplicationSourceWorkerThread
{code}
 

 

  was:
There seems to be a major issue with how we handle the EOF exception from 
WALEntryStream. 

Problem:

When we see EOFException, we try to handle it and remove it from the log queue, 
but we never try to ship the existing batch of entries. *This is a permanent 
data loss in replication.*

 

Secondly, we do not stop the reader on encountering the EOFException and thus 
if EOFException was on the last WAL, we still try to process the WALEntry 
stream and ship the empty batch with lastWALPath set to null. This is the 
reason of NPE as below. 
{code:java}
2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
regionserver.ReplicationSource - Unexpected exception in 
ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: 
Unexpected exception in ReplicationSourceWorkerThread
{code}
 

 


> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> 

[jira] [Work started] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-02-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25596 started by Sandeep Pal.
---
> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---
>
> Key: HBASE-25596
> URL: https://issues.apache.org/jira/browse/HBASE-25596
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

2021-02-22 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25596:
---

 Summary: Fix NPE in ReplicationSourceManager as well as avoid 
permanently unreplicated data due to EOFException from WAL
 Key: HBASE-25596
 URL: https://issues.apache.org/jira/browse/HBASE-25596
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Pal
Assignee: Sandeep Pal


There seems to be a major issue with how we handle the EOF exception from 
WALEntryStream. 

Problem:

When we see EOFException, we try to handle it and remove it from the log queue, 
but we never try to ship the existing batch of entries. *This is a permanent 
data loss in replication.*

 

Secondly, we do not stop the reader on encountering the EOFException and thus 
if EOFException was on the last WAL, we still try to process the WALEntry 
stream and ship the empty batch with lastWALPath set to null. This is the 
reason of NPE as below. 
{code:java}
2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
regionserver.ReplicationSource - Unexpected exception in 
ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: 
Unexpected exception in ReplicationSourceWorkerThread
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid crash

2021-02-18 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25583:

Description: 
Should not crash the region server it there is a NoNode exception while 
removing the log
 We should look into the excpetion and if it is NoNode we shouldn't crash. 
There might be a possiblity the node was deleted as part of peer tear down.
{code:java}
@Override
public void removeLog(String queueId, String filename) {
try { 
  String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
  znode = ZKUtil.joinZNode(znode, filename); ZKUtil.deleteNode(this.zookeeper, 
znode); }
catch (KeeperException e) { 
  this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId + 
", filename=" + filename + ")", e); }
}
{code}
This was the exception observed on region servers:
{code:java}
2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer - 
ABORTING region server regionserver-111,60020,1613495922885: Failed to remove 
wal from queue (queueId=xyz_peer, 
filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for 
/hbase/replication/rs/regionserver-111,60020,1613495922885/xyz_peer/regionserver-111%2C60020%2C1613495922885.16135058630
58
at org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:890)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:238)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1341)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1330)
at 
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeLog(ReplicationQueuesZKImpl.java:142)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
ger.java:232)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceMana
ger.java:222)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(Replica
tionSourceManager.java:198)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogP
osition(ReplicationSource.java:831)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(
ReplicationSource.java:746)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(Replic
ationSource.java:650)
{code}

  was:
Should not crash the region server it there is a NoNode exception while 
removing the log
We should look into the excpetion and if it is NoNode we shouldn't crash. There 
might be a possiblity the node was deleted as part of peer tear down. 

`  @Override
  public void removeLog(String queueId, String filename) {
try {
  String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId);
  znode = ZKUtil.joinZNode(znode, filename);
  ZKUtil.deleteNode(this.zookeeper, znode);
} catch (KeeperException e) {

  this.abortable.abort("Failed to remove wal from queue (queueId=" + 
queueId + ", filename="
  + filename + ")", e);
}
  }


> Handle the NoNode exception in remove log replication and avoid crash
> -
>
> Key: HBASE-25583
> URL: https://issues.apache.org/jira/browse/HBASE-25583
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Should not crash the region server it there is a NoNode exception while 
> removing the log
>  We should look into the excpetion and if it is NoNode we shouldn't crash. 
> There might be a possiblity the node was deleted as part of peer tear down.
> {code:java}
> @Override
> public void removeLog(String queueId, String filename) {
> try { 
>   String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId); 
>   znode = ZKUtil.joinZNode(znode, filename); 
> ZKUtil.deleteNode(this.zookeeper, znode); }
> catch (KeeperException e) { 
>   this.abortable.abort("Failed to remove wal from queue (queueId=" + queueId 
> + ", filename=" + filename + ")", e); }
> }
> {code}
> This was the exception observed on region servers:
> {code:java}
> 2021-02-16 20:11:58,567 FATAL [95922885,xyz_peer] regionserver.HRegionServer 
> - ABORTING region server regionserver-111,60020,1613495922885: Failed to 
> remove wal from queue (queueId=xyz_peer, 
> filename=regionserver-111%2C60020%2C1613495922885.1613505863058)
> 

[jira] [Created] (HBASE-25583) Handle the NoNode exception in remove log replication and avoid crash

2021-02-17 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25583:
---

 Summary: Handle the NoNode exception in remove log replication and 
avoid crash
 Key: HBASE-25583
 URL: https://issues.apache.org/jira/browse/HBASE-25583
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Should not crash the region server it there is a NoNode exception while 
removing the log
We should look into the excpetion and if it is NoNode we shouldn't crash. There 
might be a possiblity the node was deleted as part of peer tear down. 

`  @Override
  public void removeLog(String queueId, String filename) {
try {
  String znode = ZKUtil.joinZNode(this.myQueuesZnode, queueId);
  znode = ZKUtil.joinZNode(znode, filename);
  ZKUtil.deleteNode(this.zookeeper, znode);
} catch (KeeperException e) {

  this.abortable.abort("Failed to remove wal from queue (queueId=" + 
queueId + ", filename="
  + filename + ")", e);
}
  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25541) In WALEntryStream, set the current path to null while dequeing the log

2021-02-16 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25541 started by Sandeep Pal.
---
> In WALEntryStream, set the current path to null while dequeing the log
> --
>
> Key: HBASE-25541
> URL: https://issues.apache.org/jira/browse/HBASE-25541
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25541) In WALEntryStream, set the current path to null while dequeing the log

2021-02-16 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285500#comment-17285500
 ] 

Sandeep Pal commented on HBASE-25541:
-

Thanks for putting this [~bharathv]. You are right, on dequeing the current 
path still points to the older queue which does not seems correct.

> In WALEntryStream, set the current path to null while dequeing the log
> --
>
> Key: HBASE-25541
> URL: https://issues.apache.org/jira/browse/HBASE-25541
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25541) In WALEntryStream, set the current path to null while dequeing the log

2021-02-01 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25541:
---

 Summary: In WALEntryStream, set the current path to null while 
dequeing the log
 Key: HBASE-25541
 URL: https://issues.apache.org/jira/browse/HBASE-25541
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.6.0, 1.7.0, 1.8.0
Reporter: Sandeep Pal
Assignee: Sandeep Pal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25383) HBase doesn't update and remove the peer config from hbase.replication.source.custom.walentryfilters if the config is already set on the peer.

2020-12-22 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253329#comment-17253329
 ] 

Sandeep Pal commented on HBASE-25383:
-

Thanks [~vjasani], will submit a patch for branch-1 in few hours. 

> HBase doesn't update and remove the peer config from 
> hbase.replication.source.custom.walentryfilters if the config is already set 
> on the peer. 
> ---
>
> Key: HBASE-25383
> URL: https://issues.apache.org/jira/browse/HBASE-25383
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.4, 2.5.0, 2.4.1
>
>
> Currently, we cannot update the peer-based config even if we change the value 
> of the config.
> Secondly, once the configuration is added, there is no smooth way to remove 
> the peer config.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-25436) Backport HBASE-25383 to branch-1

2020-12-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-25436 started by Sandeep Pal.
---
> Backport HBASE-25383 to branch-1
> 
>
> Key: HBASE-25436
> URL: https://issues.apache.org/jira/browse/HBASE-25436
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Sandeep Pal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-25436) Backport HBASE-25383 to branch-1

2020-12-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-25436:
---

Assignee: Sandeep Pal

> Backport HBASE-25383 to branch-1
> 
>
> Key: HBASE-25436
> URL: https://issues.apache.org/jira/browse/HBASE-25436
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Sandeep Pal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25383) HBase doesn't update and remove the peer config from hbase.replication.source.custom.walentryfilters if the config is already set on the peer.

2020-12-14 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25383:

Description: 
Currently, we cannot update the peer-based config even if we change the value 
of the config.
Secondly, once the configuration is added, there is no smooth way to remove the 
peer config.   
Summary: HBase doesn't update and remove the peer config from 
hbase.replication.source.custom.walentryfilters if the config is already set on 
the peer.   (was: HBase doesn't update the peer config from 
hbase.replication.source.custom.walentryfilters if the config is already set on 
the peer. )

> HBase doesn't update and remove the peer config from 
> hbase.replication.source.custom.walentryfilters if the config is already set 
> on the peer. 
> ---
>
> Key: HBASE-25383
> URL: https://issues.apache.org/jira/browse/HBASE-25383
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Currently, we cannot update the peer-based config even if we change the value 
> of the config.
> Secondly, once the configuration is added, there is no smooth way to remove 
> the peer config.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25383) HBase doesn't update the peer config from hbase.replication.source.custom.walentryfilters if the config is already set on the peer.

2020-12-10 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25383:
---

 Summary: HBase doesn't update the peer config from 
hbase.replication.source.custom.walentryfilters if the config is already set on 
the peer. 
 Key: HBASE-25383
 URL: https://issues.apache.org/jira/browse/HBASE-25383
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Pal
Assignee: Sandeep Pal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24859) Optimize in-memory representation of mapreduce TableSplit objects

2020-10-30 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223865#comment-17223865
 ] 

Sandeep Pal commented on HBASE-24859:
-

[~apurtell] I raised a PR for master as well 
https://github.com/apache/hbase/pull/2609. Sorry, didn't know about the policy 
that master goes before other branches. Will take care of that for PRs in 
future. 

> Optimize in-memory representation of mapreduce TableSplit objects
> -
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Affects Versions: 3.0.0-alpha-1, 2.3.3, 1.7.0, 2.4.0, 2.2.7
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume a lot of memory in the client. This is because we keep the region 
> level information in memory and the memory heavy object is TableSplit because 
> of the Scan object as a part of it.
> However, it looks like the TableInputFormat for single table doesn't need to 
> store the scan object in the TableSplit because we do not use it and all the 
> splits are expected to have the exact same scan object. In TableInputFormat 
> we use the scan object directly from the MR conf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25226) Optimize in-memory representation for HBase map reduce table splits for MultiTableInputFormat

2020-10-28 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25226:
---

 Summary: Optimize in-memory representation for HBase map reduce 
table splits for MultiTableInputFormat
 Key: HBASE-25226
 URL: https://issues.apache.org/jira/browse/HBASE-25226
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Pal
Assignee: Sandeep Pal


It has been observed that when the table has too many regions, MR jobs consume 
a lot of memory in the client. This is because we keep the region level 
information in memory and the memory heavy object is TableSplit because of the 
Scan object as a part of it.

There is a jira 
[HBASE-24859|https://issues.apache.org/jira/projects/HBASE/issues/HBASE-24859] 
which fix this single table TableInputFormat because we do not use the scan 
object from TableSplit in this case. 
However, it looks like we can do some optimization in case of 
MultiTableInputFormat as well since each split is not required to have memory 
heavy scan object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-24859) Optimize in-memory representation of mapreduce TableSplit objects

2020-10-27 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-24859 started by Sandeep Pal.
---
> Optimize in-memory representation of mapreduce TableSplit objects
> -
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume a lot of memory in the client. This is because we keep the region 
> level information in memory and the memory heavy object is TableSplit because 
> of the Scan object as a part of it.
> However, it looks like the TableInputFormat for single table doesn't need to 
> store the scan object in the TableSplit because we do not use it and all the 
> splits are expected to have the exact same scan object. In TableInputFormat 
> we use the scan object directly from the MR conf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Optimize in-memory representation for HBase map reduce table splits

2020-10-27 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Summary: Optimize in-memory representation for HBase map reduce table 
splits  (was: Improve the storage cost for HBase map reduce table splits)

> Optimize in-memory representation for HBase map reduce table splits
> ---
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume a lot of memory in the client. This is because we keep the region 
> level information in memory and the memory heavy object is TableSplit because 
> of the Scan object as a part of it.
> However, it looks like the TableInputFormat for single table doesn't need to 
> store the scan object in the TableSplit because we do not use it and all the 
> splits are expected to have the exact same scan object. In TableInputFormat 
> we use the scan object directly from the MR conf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Improve the storage cost for HBase map reduce table splits

2020-10-27 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Description: 
It has been observed that when the table has too many regions, MR jobs consume 
a lot of memory in the client. This is because we keep the region level 
information in memory and the memory heavy object is TableSplit because of the 
Scan object as a part of it.

However, it looks like the TableInputFormat for single table doesn't need to 
store the scan object in the TableSplit because we do not use it and all the 
splits are expected to have the exact same scan object. In TableInputFormat we 
use the scan object directly from the MR conf.

  was:
It has been observed that when the table has too many regions, MR jobs consume 
more memory in the client. This is because we keep the region level information 
in memory and the memory heavy object is TableSplit because of the Scan object 
as a part of it.
We can optimize the memory consumption by not loading the region level 
information if the region is empty based on the configuration.
The default configuration can lead to all TableSplits in memory (no change from 
the current), but the configuration can enable the map-reduce job to ignore the 
empty regions. The configuration can be a part of MR job based. 



> Improve the storage cost for HBase map reduce table splits
> --
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume a lot of memory in the client. This is because we keep the region 
> level information in memory and the memory heavy object is TableSplit because 
> of the Scan object as a part of it.
> However, it looks like the TableInputFormat for single table doesn't need to 
> store the scan object in the TableSplit because we do not use it and all the 
> splits are expected to have the exact same scan object. In TableInputFormat 
> we use the scan object directly from the MR conf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Improve the storage cost for HBase map reduce table splits

2020-10-27 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Summary: Improve the storage cost for HBase map reduce table splits  (was: 
Remove the empty regions from the hbase mapreduce splits)

> Improve the storage cost for HBase map reduce table splits
> --
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25222) Add a cost function to move the daughter regions of a recent split to different region servers

2020-10-27 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25222:
---

 Summary: Add a cost function to move the daughter regions of a 
recent split to different region servers 
 Key: HBASE-25222
 URL: https://issues.apache.org/jira/browse/HBASE-25222
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Pal
Assignee: Sandeep Pal


In HBase, hotspot regions are easily formed whenever there is skew and there is 
high write volume. Few regions grow really fast which also becomes the 
bottleneck on the few region servers. 
It would be beneficial to add a cost function to move the regions after the 
split to differetn region servers. In this way the writes to hot key range will 
be distributed to multiple region servers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25193) Add support for row prefix and type in the WAL Pretty Printer

2020-10-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25193:

Description: 
Currently, the WAL Pretty Printer has an option to filter the keys with an 
exact match of row. However, it is super useful sometimes to have a row key 
prefix instead of an exact match.

The prefix can act as a full match filter as well due to the nature of the 
prefix.

Secondly, we are not having the cell type in the WAL Pretty Printer in any of 
the branches. 
Lastly, the option rowkey only options prints additional stuff as well. 

  was:
Currently, the WAL Pretty Printer has an option to filter the keys with an 
exact match of row. However, it is super useful sometimes to have a row key 
prefix instead of an exact match.

The prefix can act as a full match filter as well due to the nature of the 
prefix.

Additionally, we are not having the cell type in the WAL Pretty Printer in any 
of the branches. 


> Add support for row prefix and type in the WAL Pretty Printer
> -
>
> Key: HBASE-25193
> URL: https://issues.apache.org/jira/browse/HBASE-25193
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, the WAL Pretty Printer has an option to filter the keys with an 
> exact match of row. However, it is super useful sometimes to have a row key 
> prefix instead of an exact match.
> The prefix can act as a full match filter as well due to the nature of the 
> prefix.
> Secondly, we are not having the cell type in the WAL Pretty Printer in any of 
> the branches. 
> Lastly, the option rowkey only options prints additional stuff as well. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25193) Add support for row prefix and type in the WAL Pretty Printer and some minor fixes

2020-10-21 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25193:

Summary: Add support for row prefix and type in the WAL Pretty Printer and 
some minor fixes  (was: Add support for row prefix and type in the WAL Pretty 
Printer)

> Add support for row prefix and type in the WAL Pretty Printer and some minor 
> fixes
> --
>
> Key: HBASE-25193
> URL: https://issues.apache.org/jira/browse/HBASE-25193
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, the WAL Pretty Printer has an option to filter the keys with an 
> exact match of row. However, it is super useful sometimes to have a row key 
> prefix instead of an exact match.
> The prefix can act as a full match filter as well due to the nature of the 
> prefix.
> Secondly, we are not having the cell type in the WAL Pretty Printer in any of 
> the branches. 
> Lastly, the option rowkey only options prints additional stuff as well. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25193) Add support for row prefix and type in the WAL Pretty Printer

2020-10-19 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25193:

Description: 
Currently, the WAL Pretty Printer has an option to filter the keys with an 
exact match of row. However, it is super useful sometimes to have a row key 
prefix instead of an exact match.

The prefix can act as a full match filter as well due to the nature of the 
prefix.

Additionally, we are not having the cell type in the WAL Pretty Printer in any 
of the branches. 

  was:
Currently, the WAL Pretty Printer has an option to filter the keys with an 
exact match of row. However, it is super useful sometimes to have a row key 
prefix instead of an exact match.

The prefix can act as a full match filter as well due to the nature of the 
prefix.


> Add support for row prefix and type in the WAL Pretty Printer
> -
>
> Key: HBASE-25193
> URL: https://issues.apache.org/jira/browse/HBASE-25193
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, the WAL Pretty Printer has an option to filter the keys with an 
> exact match of row. However, it is super useful sometimes to have a row key 
> prefix instead of an exact match.
> The prefix can act as a full match filter as well due to the nature of the 
> prefix.
> Additionally, we are not having the cell type in the WAL Pretty Printer in 
> any of the branches. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25193) Add support for row prefix and type in the WAL Pretty Printer

2020-10-19 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25193:

Summary: Add support for row prefix and type in the WAL Pretty Printer  
(was: Add support for row prefix in the WAL Pretty Printer)

> Add support for row prefix and type in the WAL Pretty Printer
> -
>
> Key: HBASE-25193
> URL: https://issues.apache.org/jira/browse/HBASE-25193
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, the WAL Pretty Printer has an option to filter the keys with an 
> exact match of row. However, it is super useful sometimes to have a row key 
> prefix instead of an exact match.
> The prefix can act as a full match filter as well due to the nature of the 
> prefix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25193) Add support for row prefix in the WAL Pretty Printer

2020-10-17 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-25193:

Description: 
Currently, the WAL Pretty Printer has an option to filter the keys with an 
exact match of row. However, it is super useful sometimes to have a row key 
prefix instead of an exact match.

The prefix can act as a full match filter as well due to the nature of the 
prefix.

  was:
Current the WAL Pretty Printer has an option to filter the keys with exact 
match. However, it is super useful sometimes to have a row key prefix instead 
of exact match.

The prefix can act as full match filter as well due to the nature of prefix. 


> Add support for row prefix in the WAL Pretty Printer
> 
>
> Key: HBASE-25193
> URL: https://issues.apache.org/jira/browse/HBASE-25193
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, the WAL Pretty Printer has an option to filter the keys with an 
> exact match of row. However, it is super useful sometimes to have a row key 
> prefix instead of an exact match.
> The prefix can act as a full match filter as well due to the nature of the 
> prefix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25193) Add support for row prefix in the WAL Pretty Printer

2020-10-17 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-25193:
---

 Summary: Add support for row prefix in the WAL Pretty Printer
 Key: HBASE-25193
 URL: https://issues.apache.org/jira/browse/HBASE-25193
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Current the WAL Pretty Printer has an option to filter the keys with exact 
match. However, it is super useful sometimes to have a row key prefix instead 
of exact match.

The prefix can act as full match filter as well due to the nature of prefix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24974) Provide a flexibility to print only row key and filter for multiple tables in the WALPrettyPrinter

2020-09-12 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal resolved HBASE-24974.
-
Fix Version/s: 2.6.7
   1.7.0
   3.0.0-alpha-1
   Resolution: Fixed

> Provide a flexibility to print only row key and filter for multiple tables in 
> the WALPrettyPrinter
> --
>
> Key: HBASE-24974
> URL: https://issues.apache.org/jira/browse/HBASE-24974
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.6.7
>
>
> Currently, 
> [WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
>  provides an option to ignore the values in the output, but it prints the 
> whole cell and has no option to ignore some information from the cell. 
> Sometimes, the user may only need the row keys from WAL and it may reduce the 
> size of output from WALPrettyPrinter significantly. 
> We should provide flexibility to output only rowkey from the cell. 
> In addition we should increase the flexibility for providing multiple tables 
> in the table filter. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-24974) Provide a flexibility to print only row key and filter for multiple tables in the WALPrettyPrinter

2020-09-12 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-24974 started by Sandeep Pal.
---
> Provide a flexibility to print only row key and filter for multiple tables in 
> the WALPrettyPrinter
> --
>
> Key: HBASE-24974
> URL: https://issues.apache.org/jira/browse/HBASE-24974
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, 
> [WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
>  provides an option to ignore the values in the output, but it prints the 
> whole cell and has no option to ignore some information from the cell. 
> Sometimes, the user may only need the row keys from WAL and it may reduce the 
> size of output from WALPrettyPrinter significantly. 
> We should provide flexibility to output only rowkey from the cell. 
> In addition we should increase the flexibility for providing multiple tables 
> in the table filter. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24974) Provide a flexibility to print only row key and filter for multiple tables in the WALPrettyPrinter

2020-09-02 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24974:

Description: 
Currently, 
[WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
 provides an option to ignore the values in the output, but it prints the whole 
cell and has no option to ignore some information from the cell. Sometimes, the 
user may only need the row keys from WAL and it may reduce the size of output 
from WALPrettyPrinter significantly. 

We should provide flexibility to output only rowkey from the cell. 

In addition we should increase the flexibility for providing multiple tables in 
the table filter. 

  was:
Currently, 
[WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
 provides an option to ignore the values in the output, but it prints the whole 
cell and has no option to ignore some information from the cell. Sometimes, the 
user may only need the row keys from WAL and it may reduce the size of output 
from WALPrettyPrinter significantly. 

We should provide flexibility to output only rowkey from the cell. 

Summary: Provide a flexibility to print only row key and filter for 
multiple tables in the WALPrettyPrinter  (was: Provide a flexibility to print 
only row key in the WALPrettyPrinter)

> Provide a flexibility to print only row key and filter for multiple tables in 
> the WALPrettyPrinter
> --
>
> Key: HBASE-24974
> URL: https://issues.apache.org/jira/browse/HBASE-24974
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, 
> [WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
>  provides an option to ignore the values in the output, but it prints the 
> whole cell and has no option to ignore some information from the cell. 
> Sometimes, the user may only need the row keys from WAL and it may reduce the 
> size of output from WALPrettyPrinter significantly. 
> We should provide flexibility to output only rowkey from the cell. 
> In addition we should increase the flexibility for providing multiple tables 
> in the table filter. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24974) Provide a flexibility to print only row key in the WALPrettyPrinter

2020-09-01 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24974:

Description: 
Currently, 
[WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
 provides an option to ignore the values in the output, but it prints the whole 
cell and has no option to ignore some information from the cell. Sometimes, the 
user may only need the row keys from WAL and it may reduce the size of output 
from WALPrettyPrinter significantly. 

We should provide flexibility to output only rowkey from the cell. 

  was:
Currently, 
[WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
 provides an option to ignore the values in the output, but it prints the whole 
cell. Sometimes, the user may only need the row keys from WAL and it may reduce 
the size of output from WALPrettyPrinter significantly. 

We should provide flexibility to output only rowkey from the cell. 


> Provide a flexibility to print only row key in the WALPrettyPrinter
> ---
>
> Key: HBASE-24974
> URL: https://issues.apache.org/jira/browse/HBASE-24974
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, 
> [WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
>  provides an option to ignore the values in the output, but it prints the 
> whole cell and has no option to ignore some information from the cell. 
> Sometimes, the user may only need the row keys from WAL and it may reduce the 
> size of output from WALPrettyPrinter significantly. 
> We should provide flexibility to output only rowkey from the cell. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24974) Provide a flexibility to print only row key in the WALPrettyPrinter

2020-09-01 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24974:
---

 Summary: Provide a flexibility to print only row key in the 
WALPrettyPrinter
 Key: HBASE-24974
 URL: https://issues.apache.org/jira/browse/HBASE-24974
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Currently, 
[WALPrettyPrinter|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java]
 provides an option to ignore the values in the output, but it prints the whole 
cell. Sometimes, the user may only need the row keys from WAL and it may reduce 
the size of output from WALPrettyPrinter significantly. 

We should provide flexibility to output only rowkey from the cell. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Attachment: (was: screenshot-1.png)

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185280#comment-17185280
 ] 

Sandeep Pal edited comment on HBASE-24859 at 8/26/20, 5:49 PM:
---

[~bharathv] [~shahrs87] [~apurtell]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.

 !Screen Shot 2020-08-26 at 8.44.34 AM.png! 


was (Author: sandeep.pal):
[~bharathv] [~shahrs87] [~apurtell]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.



> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, 
> hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185280#comment-17185280
 ] 

Sandeep Pal edited comment on HBASE-24859 at 8/26/20, 5:49 PM:
---

[~bharathv] [~shahrs87] [~apurtell]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.




was (Author: sandeep.pal):
[~bharathv] [~shahrs87] [~apurtell]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.

 !screenshot-1.png! 

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, 
> hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185280#comment-17185280
 ] 

Sandeep Pal edited comment on HBASE-24859 at 8/26/20, 4:00 PM:
---

[~bharathv] [~shahrs87] [~apurtell]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.

 !screenshot-1.png! 


was (Author: sandeep.pal):
[~bharathv][~shahrs87]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.

 !screenshot-1.png! 

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Description: 
It has been observed that when the table has too many regions, MR jobs consume 
more memory in the client. This is because we keep the region level information 
in memory and the memory heavy object is TableSplit because of the Scan object 
as a part of it.
We can optimize the memory consumption by not loading the region level 
information if the region is empty based on the configuration.
The default configuration can lead to all TableSplits in memory (no change from 
the current), but the configuration can enable the map-reduce job to ignore the 
empty regions. The configuration can be a part of MR job based. 


  was:
It has been observed that when the table has too many regions, MR jobs consume 
more memory in the client. This is because we keep the region level information 
in memory and the memory heavy object is TableSplit because of Scan object as a 
part of it.
We can optimize the memory consumption by not loading the region level 
information if the region is empty. 


> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185280#comment-17185280
 ] 

Sandeep Pal commented on HBASE-24859:
-

[~bharathv][~shahrs87]
The heap is predominantly occupied by TableSplit and especially the 
[scan|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSplit.java#L87]
 within the TableSplit.

 !screenshot-1.png! 

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Attachment: screenshot-1.png

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:

Comment: was deleted

(was: Closing this since it may lead to data misses if new data received after 
table splits information get passed to MR. )

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reopened HBASE-24859:
-

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-11 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal resolved HBASE-24859.
-
Resolution: Abandoned

Closing this since it may lead to data misses if new data received after table 
splits information get passed to MR. 

> Remove the empty regions from the hbase mapreduce splits
> 
>
> Key: HBASE-24859
> URL: https://issues.apache.org/jira/browse/HBASE-24859
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

2020-08-11 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24859:
---

 Summary: Remove the empty regions from the hbase mapreduce splits
 Key: HBASE-24859
 URL: https://issues.apache.org/jira/browse/HBASE-24859
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Sandeep Pal
Assignee: Sandeep Pal


It has been observed that when the table has too many regions, MR jobs consume 
more memory in the client. This is because we keep the region level information 
in memory and the memory heavy object is TableSplit because of Scan object as a 
part of it.
We can optimize the memory consumption by not loading the region level 
information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-24788) Zookeeper connection leakage on hbase mapreduce jobs

2020-07-28 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-24788 started by Sandeep Pal.
---
> Zookeeper connection leakage on hbase mapreduce jobs
> 
>
> Key: HBASE-24788
> URL: https://issues.apache.org/jira/browse/HBASE-24788
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Affects Versions: 1.6.0, master
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Observed the significant increase in ZK connection on performance testing on 
> map reduce jobs. Turns out the 
> [TableOutputFormat.checkOutputSpecs()|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
>   is not closing the connection it uses to get the hbase admin. It closes the 
> hbase admin but never close the connection to get the admin.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24788) Zookeeper connection leakage on hbase mapreduce jobs

2020-07-28 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24788:

Description: Observed the significant increase in ZK connection on 
performance testing on map reduce jobs. Turns out the 
[TableOutputFormat.checkOutputSpecs()|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
  is not closing the connection it uses to get the hbase admin. It closes the 
hbase admin but never close the connection to get the admin.(was: Observed 
the significant increase in ZK connection on performance testing on map reduce 
jobs. Turns out the 
[TableOutputFormat|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
 is not not closing the connection it uses to get the hbase admin. It closes 
the hbase admin but never close the connection to get the admin.  )

> Zookeeper connection leakage on hbase mapreduce jobs
> 
>
> Key: HBASE-24788
> URL: https://issues.apache.org/jira/browse/HBASE-24788
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Affects Versions: 1.6.0, master
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Observed the significant increase in ZK connection on performance testing on 
> map reduce jobs. Turns out the 
> [TableOutputFormat.checkOutputSpecs()|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
>   is not closing the connection it uses to get the hbase admin. It closes the 
> hbase admin but never close the connection to get the admin.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24788) Zookeeper connection leakage on hbase mapreduce jobs

2020-07-28 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24788:

Description: Observed the significant increase in ZK connection on 
performance testing on map reduce jobs. Turns out the 
[TableOutputFormat|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
 is not not closing the connection it uses to get the hbase admin. It closes 
the hbase admin but never close the connection to get the admin.(was: 
Observed the significant increase in ZK connection on performance testing on 
map reduce jobs. Turns out the TableOutputFormat is not not closing the 
connection it uses to get the hbase admin. It closes the hbase admin but never 
close the connection to get the admin.  )

> Zookeeper connection leakage on hbase mapreduce jobs
> 
>
> Key: HBASE-24788
> URL: https://issues.apache.org/jira/browse/HBASE-24788
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Affects Versions: 1.6.0, master
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> Observed the significant increase in ZK connection on performance testing on 
> map reduce jobs. Turns out the 
> [TableOutputFormat|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java#L182]
>  is not not closing the connection it uses to get the hbase admin. It closes 
> the hbase admin but never close the connection to get the admin.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24788) Zookeeper connection leakage on hbase mapreduce jobs

2020-07-28 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24788:
---

 Summary: Zookeeper connection leakage on hbase mapreduce jobs
 Key: HBASE-24788
 URL: https://issues.apache.org/jira/browse/HBASE-24788
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 1.6.0, master
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Observed the significant increase in ZK connection on performance testing on 
map reduce jobs. Turns out the TableOutputFormat is not not closing the 
connection it uses to get the hbase admin. It closes the hbase admin but never 
close the connection to get the admin.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24716) Do the error handling for replication admin failures

2020-07-10 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24716:

Description: 
[listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]

 

Snippet:


{code:java}
catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;
{code}


 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 

  was:
[listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 


> Do the error handling for replication admin failures
> 
>
> Key: HBASE-24716
> URL: https://issues.apache.org/jira/browse/HBASE-24716
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> [listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
>  for getting the list of peers along with their configuration is not a 
> reliable API.
> It is not very robust to errors, logs FATAL and swallows the 
> [exceptions|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]
>  
> Snippet:
> {code:java}
> catch (KeeperException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> } catch (ReplicationException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> }
> return peers;
> {code}
>  
> The abortable (connection in 

[jira] [Updated] (HBASE-24716) Do the error handling for replication admin failures

2020-07-10 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24716:

Description: 
[listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 

  was:
[listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|[https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 


> Do the error handling for replication admin failures
> 
>
> Key: HBASE-24716
> URL: https://issues.apache.org/jira/browse/HBASE-24716
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> [listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
>  for getting the list of peers along with their configuration is not a 
> reliable API.
> It is not very robust to errors, logs FATAL and swallows the 
> [exceptions|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]
>  
> Snippet:
> catch (KeeperException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> } catch (ReplicationException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> }
> return peers;
>  
> The abortable (connection in this case) also doesn't abort the region 

[jira] [Updated] (HBASE-24716) Do the error handling for replication admin failures

2020-07-10 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24716:

Description: 
[listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|[https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 

  was:
[listPeerConfigs()|[https://git.soma.salesforce.com/bigdata-packaging/hbase/blob/1.6.0-sfdc-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|[https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]]
 

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 


> Do the error handling for replication admin failures
> 
>
> Key: HBASE-24716
> URL: https://issues.apache.org/jira/browse/HBASE-24716
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> [listPeerConfigs()|https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]
>  for getting the list of peers along with their configuration is not a 
> reliable API.
> It is not very robust to errors, logs FATAL and swallows the 
> [exceptions|[https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]
>  
> Snippet:
> catch (KeeperException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> } catch (ReplicationException e) {
>  this.abortable.abort("Cannot get the list of peers ", e);
> }
> return peers;
>  
> The abortable (connection in this case) 

[jira] [Created] (HBASE-24716) Do the error handling for replication admin failures

2020-07-10 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24716:
---

 Summary: Do the error handling for replication admin failures
 Key: HBASE-24716
 URL: https://issues.apache.org/jira/browse/HBASE-24716
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Sandeep Pal
Assignee: Sandeep Pal


[listPeerConfigs()|[https://git.soma.salesforce.com/bigdata-packaging/hbase/blob/1.6.0-sfdc-1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java#L295]]
 for getting the list of peers along with their configuration is not a reliable 
API.

It is not very robust to errors, logs FATAL and swallows the 
[exceptions|[https://github.com/apache/hbase/blob/branch-1/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeersZKImpl.java#L254]]
 

 

Snippet:

catch (KeeperException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
} catch (ReplicationException e) {
 this.abortable.abort("Cannot get the list of peers ", e);
}
return peers;

 


The abortable (connection in this case) also doesn't abort the region server 
and just logs. This makes upstream believe that there is nothing wrong and 
proceed without any action which is not good.

 

 
{code:java}
2020-07-07 23:11:37,857 FATAL [14774961,peer_id] 
client.ConnectionManager$HConnectionImplementation - Cannot get the list of 
peersorg.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase/replication/peersat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:130)at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)at 
org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1549)at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:312)at
 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:513)at 
org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getAllPeerConfigs(ReplicationPeersZKImpl.java:249)at
 
org.apache.hadoop.hbase.client.replication.ReplicationAdmin.listPeerConfigs(ReplicationAdmin.java:332)
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24543) ScheduledChore logging is too chatty, replace with metrics

2020-06-19 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-24543:
---

Assignee: Sandeep Pal

> ScheduledChore logging is too chatty, replace with metrics
> --
>
> Key: HBASE-24543
> URL: https://issues.apache.org/jira/browse/HBASE-24543
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics, Operability
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Pal
>Priority: Minor
>
> ScheduledChore logs at DEBUG level the execution time of each chore. 
> We used to log an average execution time across all chores every five 
> minutes, which by consensus was judged to not be useful. Derived metrics like 
> averages or histograms should be calculated per chore. So we modified the 
> logging to dump the chore execution time each time it runs, to facilitate 
> such calculations with the log aggregation and searching tool of choice. Per 
> chore execution logging is more useful, in that sense, but may be too chatty. 
> This is not unexpected but let me provide my observations so we can revisit 
> this.
> On the master, for example, this is logged every second:
> {noformat}
> 2020-06-11 16:35:28,263 DEBUG 
> [master/apurtell-ltm:8100.splitLogManager..Chore.1] hbase.ScheduledChore: 
> SplitLogManager Timeout Monitor execution time: 0 ms.
> {noformat}
> Does the value of these lines outweigh the cost of 86,400 log lines per day 
> per master instance? (At least.)
> On the regionserver it is somewhat better, these are logged every 10 seconds:
> {noformat}
> 2020-06-11 16:37:57,203 DEBUG [regionserver/apurtell-ltm:8120.Chore.1] 
> hbase.ScheduledChore: CompactionChecker execution time: 0 ms.
> 2020-06-11 16:37:57,203 DEBUG [regionserver/apurtell-ltm:8120.Chore.1] 
> hbase.ScheduledChore: MemstoreFlusherChore execution time: 0 ms.
> {noformat}
> So that will be 17,280 log lines per day per regionserver. (At least.)
> These could be moved to TRACE level, perhaps. 
> I propose we replace this logging with histogram metrics. There should be a 
> separate metric for each distinct chore classname, allocated as needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24409) Expose a function in HBase WALKey to set the tablename

2020-06-13 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal resolved HBASE-24409.
-
Resolution: Abandoned

> Expose a function in HBase WALKey to set the tablename
> --
>
> Key: HBASE-24409
> URL: https://issues.apache.org/jira/browse/HBASE-24409
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, table name in WALKey can not be changed once set. But exposing 
> this function to change the table name can be very helpful for Customized WAL 
> filters since they can flip the table name and make replication possible 
> between different table names in source and sink clusters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-24015) Coverage for Assign and Unassign of Regions on RegionServer on failure

2020-06-13 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-24015 started by Sandeep Pal.
---
> Coverage for Assign and Unassign of Regions on RegionServer on failure
> --
>
> Key: HBASE-24015
> URL: https://issues.apache.org/jira/browse/HBASE-24015
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: Michael Stack
>Assignee: Sandeep Pal
>Priority: Major
>
> Looking at 'HBASE-23984 [Flakey Tests] TestMasterAbortAndRSGotKilled fails in 
> teardown', and at UnassignRegionHandler, AssignRegionHandler, 
> CloseRegionHandler, and the work that is done inline w/ request vs that which 
> to the side in executors, we need more coverage and specification of what 
> happens around the edges. This coverage would be more to see if holes in our 
> handling currently in a unit test case context before we see it out on 
> clusters.
> HBASE-23984  addresses holes where UnassignRegionHandler and 
> AssignRegionHandler could skip out w/o clearing Regions from the 
> RegionServer#regionsInTransitionInRS Map of Regions In Transition if failed 
> open or close because the RegionServer is aborting.
> Other holes seem lurking. On exception, we were leaving entries in the 
> RegionServer# submittedRegionProcedure map added by HBASE-2204; not the end 
> of the world but they should be cleared on error? HBASE-23984 adds clearning 
> from submittedRegionProcedure but then procedures even if failed get added to 
> the cache of procedures... so if we try to run the procedure again against 
> this server it won't be scheduled.
> interesting stuff.
> This issue is about adding tests that fail assign/unassign/close on the 
> RegionServer side making sure RS state is left in a good condition on fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24439) Replication queue recovery tool for rescuing deep queues

2020-05-26 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-24439:
---

Assignee: Sandeep Pal

> Replication queue recovery tool for rescuing deep queues
> 
>
> Key: HBASE-24439
> URL: https://issues.apache.org/jira/browse/HBASE-24439
> Project: HBase
>  Issue Type: Brainstorming
>  Components: Replication
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Pal
>Priority: Major
>
> In HBase cross site replication, on the source side, every regionserver 
> places its WALs into a replication queue and then drains the queue to the 
> remote sink cluster. At the source cluster every regionserver participates as 
> a source. At the sink cluster, a configurable subset of regionservers 
> volunteer to process inbound replication RPC. 
> When data is highly skewed we can take certain steps to mitigate, such as 
> pre-splitting, or manual splitting, and rebalancing. This can most 
> effectively be done at the sink, because replication RPCs are randomly 
> distributed over the set of receiving regionservers, and splitting on the 
> sink side can effectively redistribute resulting writes there. On the source 
> side we are more limited. 
> If writes are deeply unbalanced, a regionserver's source replication queue 
> may become very deep. Hotspotting can happen, despite mitigations. Unlike on 
> the sink side, once hotspotting has happened at the source, it is not 
> possible to increase parallelism or redistribute work among sources once WALs 
> have already been enqueued. Increasing parallelism on the sink side will not 
> help if there is a big rock at the source. Source side mitigations like 
> splitting and region redistribution cannot help deep queues already 
> accumulated.
> Can we redistribute source work? Yes and no. If a source regionserver fails, 
> its queues will be recovered by other regionservers. However the other rs 
> must still serve the recovered queue as an atomic entity. We can move a deep 
> queue, but we can't break it up. 
> Where time is of the essence, and ordering semantics can be allowed to break, 
> operators should have available to them a recovery tool that rescues their 
> production from the consequences of  deep source queues. A very large 
> replication queue can be split into many smaller queues. Perhaps even one new 
> queue for each WAL file. Then, these new synthetic queues can be distributed 
> to any/all source regionservers through the normal recovery queue assignment 
> protocol. This increases parallelism at the source.
> Of course this would break serial replication semantics and even in branch-1 
> which does not have that feature it would signficantly increase the 
> probability of reordering of edits. That is an unavoidable consequence of 
> breaking up the queue for more parallelism. As long as this is done by a 
> separate tool, invoked by operators, it is a valid option for emergency 
> drain, and once the drain is complete, the final state will be properly 
> ordered. Every cell in the WAL entries carries a timestamp assigned at the 
> source, and will be applied on the sink with this timestamp. When the queue 
> is drained and all edits have been persisted at the target, there will be a 
> complete and correct temporal data ordering at that time. An operator will be 
> and must be prepared to handle intermediate mis-/re-ordered states if they 
> intend to invoke this tool. In many use cases the interim states are not 
> important. The final state after all edits have transferred cross cluster and 
> persisted at this sink, after invocation of the recovery tool, is the point 
> where the operator would transition back into service.
> As a strawman we can propose these work items:
> - Add a replication admin command that can reassign a replication queue away 
> from an active source. The active source makes a new queue and continues. The 
> previously active queue can be assigned to another regionserver as a recovery 
> queue or can be left unassigned (e.g. target = null)
> - Administratively unassigned recovery queues should not be automatically 
> processed, but must be discoverable. 
> - Add a replication admin command that transitions an unassigned replication 
> queue into an active and eligible recovery queue.
> - Create a tool that uses these new APIs to take control of a (presumably 
> deep) replication queue, breaks up the queue into its constituent WAL files, 
> creates new synthetic queues according to a configurable and parameterized 
> grouping function, and uses the new APIs to make the new synthetic queues 
> eligible for recovery. The original queue retains one group as defined by the 
> grouping policy and itself is made re-eligible for recovery. 



--
This message 

[jira] [Commented] (HBASE-24409) Expose a function in HBase WALKey to set the tablename

2020-05-22 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114361#comment-17114361
 ] 

Sandeep Pal commented on HBASE-24409:
-

[~apurtell] I will update shortly on this.

> Expose a function in HBase WALKey to set the tablename
> --
>
> Key: HBASE-24409
> URL: https://issues.apache.org/jira/browse/HBASE-24409
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, table name in WALKey can not be changed once set. But exposing 
> this function to change the table name can be very helpful for Customized WAL 
> filters since they can flip the table name and make replication possible 
> between different table names in source and sink clusters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24409) Expose a function in HBase WALKey to set the tablename

2020-05-22 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24409:

Priority: Minor  (was: Major)

> Expose a function in HBase WALKey to set the tablename
> --
>
> Key: HBASE-24409
> URL: https://issues.apache.org/jira/browse/HBASE-24409
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Minor
>
> Currently, table name in WALKey can not be changed once set. But exposing 
> this function to change the table name can be very helpful for Customized WAL 
> filters since they can flip the table name and make replication possible 
> between different table names in source and sink clusters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24409) Expose a function in HBase WALKey to set the tablename

2020-05-21 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24409:
---

 Summary: Expose a function in HBase WALKey to set the tablename
 Key: HBASE-24409
 URL: https://issues.apache.org/jira/browse/HBASE-24409
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Pal
Assignee: Sandeep Pal


Currently, table name in WALKey can not be changed once set. But exposing this 
function to change the table name can be very helpful for Customized WAL 
filters since they can flip the table name and make replication possible 
between different table names in source and sink clusters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24015) Coverage for Assign and Unassign of Regions on RegionServer on failure

2020-05-14 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-24015:
---

Assignee: Sandeep Pal

> Coverage for Assign and Unassign of Regions on RegionServer on failure
> --
>
> Key: HBASE-24015
> URL: https://issues.apache.org/jira/browse/HBASE-24015
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: Michael Stack
>Assignee: Sandeep Pal
>Priority: Major
>
> Looking at 'HBASE-23984 [Flakey Tests] TestMasterAbortAndRSGotKilled fails in 
> teardown', and at UnassignRegionHandler, AssignRegionHandler, 
> CloseRegionHandler, and the work that is done inline w/ request vs that which 
> to the side in executors, we need more coverage and specification of what 
> happens around the edges. This coverage would be more to see if holes in our 
> handling currently in a unit test case context before we see it out on 
> clusters.
> HBASE-23984  addresses holes where UnassignRegionHandler and 
> AssignRegionHandler could skip out w/o clearing Regions from the 
> RegionServer#regionsInTransitionInRS Map of Regions In Transition if failed 
> open or close because the RegionServer is aborting.
> Other holes seem lurking. On exception, we were leaving entries in the 
> RegionServer# submittedRegionProcedure map added by HBASE-2204; not the end 
> of the world but they should be cleared on error? HBASE-23984 adds clearning 
> from submittedRegionProcedure but then procedures even if failed get added to 
> the cache of procedures... so if we try to run the procedure again against 
> this server it won't be scheduled.
> interesting stuff.
> This issue is about adding tests that fail assign/unassign/close on the 
> RegionServer side making sure RS state is left in a good condition on fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24226) Address other hard references to '/tmp' found in Configuration

2020-05-14 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24226:

Description: HBASE-24175 started up cleaning hard /tmp references out of 
Configuration when tests run. I got most of them but then if its hadoop2 or 
hadoop3 or jenkins or local, the list seems to change. Here are more...  (was: 
HBASE-24175 started up cleaning hard /tmp references out of Configuration when 
tests run. I got most of them but then if its hadoop2 or hadoop3 or jenkins or 
local, the list seems to change. Here are more...

)

> Address other hard references to '/tmp' found in Configuration
> --
>
> Key: HBASE-24226
> URL: https://issues.apache.org/jira/browse/HBASE-24226
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> HBASE-24175 started up cleaning hard /tmp references out of Configuration 
> when tests run. I got most of them but then if its hadoop2 or hadoop3 or 
> jenkins or local, the list seems to change. Here are more...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-23126) IntegrationTestRSGroup is useless now

2020-05-14 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-23126:
---

Assignee: Sandeep Pal

> IntegrationTestRSGroup is useless now
> -
>
> Key: HBASE-23126
> URL: https://issues.apache.org/jira/browse/HBASE-23126
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Duo Zhang
>Assignee: Sandeep Pal
>Priority: Major
>
> It extends TestRSGroupsBase and wants to run all the UTs defined in 
> TestRSGroupsBase,  but after HBASE-21265, all the UTs have been moved to sub 
> classes...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24350) HBase table level replication metrics for shippedBytes are always 0

2020-05-13 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24350:

Affects Version/s: 2.4.0
   1.7.0
   master
   3.0.0-alpha-1

> HBase table level replication metrics for shippedBytes are always 0
> ---
>
> Key: HBASE-24350
> URL: https://issues.apache.org/jira/browse/HBASE-24350
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 3.0.0-alpha-1, master, 1.7.0, 2.4.0
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It was observed during some investigations that table level metrics for 
> shippedBytes are always 0 consistently even though data is getting shipped.
> There are two problems with table-level metrics:
>  # There are no table-level metrics for shipped bytes.
>  # Another problem is that it's using `MetricsReplicationSourceSourceImpl` 
> which is creating all source-level metrics at table level as well but updated 
> only ageOfLastShippedOp. This reports lot of false/incorrect replication 
> metrics at table level. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-24350) HBase table level replication metrics for shippedBytes are always 0

2020-05-13 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-24350 started by Sandeep Pal.
---
> HBase table level replication metrics for shippedBytes are always 0
> ---
>
> Key: HBASE-24350
> URL: https://issues.apache.org/jira/browse/HBASE-24350
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It was observed during some investigations that table level metrics for 
> shippedBytes are always 0 consistently even though data is getting shipped.
> There are two problems with table-level metrics:
>  # There are no table-level metrics for shipped bytes.
>  # Another problem is that it's using `MetricsReplicationSourceSourceImpl` 
> which is creating all source-level metrics at table level as well but updated 
> only ageOfLastShippedOp. This reports lot of false/incorrect replication 
> metrics at table level. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24350) HBase table level replication metrics for shippedBytes are always 0

2020-05-13 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24350:

Description: 
It was observed during some investigations that table level metrics for 
shippedBytes are always 0 consistently even though data is getting shipped.

There are two problems with table-level metrics:
 # There are no table-level metrics for shipped bytes.
 # Another problem is that it's using `MetricsReplicationSourceSourceImpl` 
which is creating all source-level metrics at table level as well but updated 
only ageOfLastShippedOp. This reports lot of false/incorrect replication 
metrics at table level. 

  was:It was observed during some investigations that table level metrics for 
shippedBytes are always 0 consistently even though data is getting shipped.


> HBase table level replication metrics for shippedBytes are always 0
> ---
>
> Key: HBASE-24350
> URL: https://issues.apache.org/jira/browse/HBASE-24350
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
>
> It was observed during some investigations that table level metrics for 
> shippedBytes are always 0 consistently even though data is getting shipped.
> There are two problems with table-level metrics:
>  # There are no table-level metrics for shipped bytes.
>  # Another problem is that it's using `MetricsReplicationSourceSourceImpl` 
> which is creating all source-level metrics at table level as well but updated 
> only ageOfLastShippedOp. This reports lot of false/incorrect replication 
> metrics at table level. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24350) HBase table level replication metrics for shippedBytes are always 0

2020-05-11 Thread Sandeep Pal (Jira)
Sandeep Pal created HBASE-24350:
---

 Summary: HBase table level replication metrics for shippedBytes 
are always 0
 Key: HBASE-24350
 URL: https://issues.apache.org/jira/browse/HBASE-24350
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Sandeep Pal
Assignee: Sandeep Pal


It was observed during some investigations that table level metrics for 
shippedBytes are always 0 consistently even though data is getting shipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-21831) Optional store-and-forward of simple mutations for regions in transition

2020-05-11 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reassigned HBASE-21831:
---

Assignee: (was: Sandeep Pal)

> Optional store-and-forward of simple mutations for regions in transition
> 
>
> Key: HBASE-21831
> URL: https://issues.apache.org/jira/browse/HBASE-21831
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, rpc
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> We have an internal service built on Redis that is considering writing 
> through to HBase directly for their persistence needs. Their current 
> experience with Redis is
>  * Average write latency is ~milliseconds
>  * p999 write latencies with are "a few seconds"
> They want a similar experience when writing simple values directly to HBase. 
> Infrequent exceptions to this would be acceptable. 
>  * Availability of 99.9% for writes
>  * Expect most writes to be serviced within a few milliseconds, e.g. few 
> millis at p95. Still evaluating what the requirement should be (~millis at 
> p90 vs p95 vs p99).
>  * Timeout of 2 seconds, should be rare
> There is a fallback plan considered if HBase cannot respond within 2 seconds. 
> However this fallback cannot guarantee durability. Redis or the service's 
> daemons may go down. They want HBase to provide required durability.
> Because this is a caching service, where all writes are expected to be served 
> again from cache, at least for a while, if HBase were to accept writes such 
> that they are not immediately visible, it could be fine that they are not 
> visible for 10-20 minutes in the worst case. This is relatively easy to 
> achieve as an engineering target should we consider offering a write option 
> that does not guarantee immediate visibility. (A proposal follows below.) We 
> are considering store-and-forward of simple mutations and perhaps also simple 
> deletes, although the latter is not a hard requirement. Out of order 
> processing of this subset of mutation requests is acceptable because their 
> data model ensures all values are immutable. Presumably on the HBase side the 
> timestamps of the requests would be set to the current server wall clock time 
> when received, so eventually when applied all are available with correct 
> temporal ordering (within the effective resolution of the server clocks). 
> Deletes which are not immediately applied (or failed) could cause application 
> level confusion, and although this would remain a concern for the general 
> case, for this specific use case, stale reads could be explained to and 
> tolerated by their users.
> The BigTable architecture assigns at most one server to serve a region at a 
> time. Region Replicas are an enhancement to the base BigTable architecture we 
> made in HBase which stands up two more read-only replicas for a given region, 
> meaning a client attempting a read has the option to fail very quickly over 
> from the primary to a replica for a (potentially stale) read, or distribute 
> read load over all replicas, or employ a hedged reading strategy. Enabling 
> region replicas and timeline consistency can lower the availability gap for 
> reads in the high percentiles from ~minutes to ~milliseconds. However, this 
> option will not help for write use cases wanting roughly the same thing, 
> because there can be no fail-over for writes. Writes must still go to the 
> active primary. When that region is in transition, writes must be held on the 
> client until it is redeployed. Or, if region replicas are not enabled, when 
> the sole region is in transition, again, writes must be held on the client 
> until the region is available again.
> Regions enter the in-transition state for two reasons: failures, and 
> housekeeping (splits and merges, or balancing). Time to region redeployment 
> after failures depends on a number of factors, like how long it took for us 
> to become aware of the failure, and how long it takes to split the 
> write-ahead log of the failed server and distribute the recovered edits to 
> the reopening region(s). We could in theory improve this behavior by being 
> more predictive about declaring failure, like employing a phi accrual failure 
> detector to signal to the master from clients that a regionserver is sick. 
> Other time-to-recovery issues and mitigations are discussed in a number of 
> JIRAs and blog posts and not discussed further here. Regarding housekeeping 
> activities, splits and merges typically complete in under a second. However, 
> split times up to ~30 seconds have been observed at my place of employ in 
> rare conditions. In the instances I have investigated the cause is I/O stalls 
> on the datanodes and 

[jira] [Commented] (HBASE-21831) Optional store-and-forward of simple mutations for regions in transition

2020-03-06 Thread Sandeep Pal (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053868#comment-17053868
 ] 

Sandeep Pal commented on HBASE-21831:
-

[~apurtell] Yes, I was planning to work on this. Not started yet though. 

> Optional store-and-forward of simple mutations for regions in transition
> 
>
> Key: HBASE-21831
> URL: https://issues.apache.org/jira/browse/HBASE-21831
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, rpc
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Pal
>Priority: Major
> Fix For: 3.0.0, 1.7.0, 2.4.0
>
>
> We have an internal service built on Redis that is considering writing 
> through to HBase directly for their persistence needs. Their current 
> experience with Redis is
>  * Average write latency is ~milliseconds
>  * p999 write latencies with are "a few seconds"
> They want a similar experience when writing simple values directly to HBase. 
> Infrequent exceptions to this would be acceptable. 
>  * Availability of 99.9% for writes
>  * Expect most writes to be serviced within a few milliseconds, e.g. few 
> millis at p95. Still evaluating what the requirement should be (~millis at 
> p90 vs p95 vs p99).
>  * Timeout of 2 seconds, should be rare
> There is a fallback plan considered if HBase cannot respond within 2 seconds. 
> However this fallback cannot guarantee durability. Redis or the service's 
> daemons may go down. They want HBase to provide required durability.
> Because this is a caching service, where all writes are expected to be served 
> again from cache, at least for a while, if HBase were to accept writes such 
> that they are not immediately visible, it could be fine that they are not 
> visible for 10-20 minutes in the worst case. This is relatively easy to 
> achieve as an engineering target should we consider offering a write option 
> that does not guarantee immediate visibility. (A proposal follows below.) We 
> are considering store-and-forward of simple mutations and perhaps also simple 
> deletes, although the latter is not a hard requirement. Out of order 
> processing of this subset of mutation requests is acceptable because their 
> data model ensures all values are immutable. Presumably on the HBase side the 
> timestamps of the requests would be set to the current server wall clock time 
> when received, so eventually when applied all are available with correct 
> temporal ordering (within the effective resolution of the server clocks). 
> Deletes which are not immediately applied (or failed) could cause application 
> level confusion, and although this would remain a concern for the general 
> case, for this specific use case, stale reads could be explained to and 
> tolerated by their users.
> The BigTable architecture assigns at most one server to serve a region at a 
> time. Region Replicas are an enhancement to the base BigTable architecture we 
> made in HBase which stands up two more read-only replicas for a given region, 
> meaning a client attempting a read has the option to fail very quickly over 
> from the primary to a replica for a (potentially stale) read, or distribute 
> read load over all replicas, or employ a hedged reading strategy. Enabling 
> region replicas and timeline consistency can lower the availability gap for 
> reads in the high percentiles from ~minutes to ~milliseconds. However, this 
> option will not help for write use cases wanting roughly the same thing, 
> because there can be no fail-over for writes. Writes must still go to the 
> active primary. When that region is in transition, writes must be held on the 
> client until it is redeployed. Or, if region replicas are not enabled, when 
> the sole region is in transition, again, writes must be held on the client 
> until the region is available again.
> Regions enter the in-transition state for two reasons: failures, and 
> housekeeping (splits and merges, or balancing). Time to region redeployment 
> after failures depends on a number of factors, like how long it took for us 
> to become aware of the failure, and how long it takes to split the 
> write-ahead log of the failed server and distribute the recovered edits to 
> the reopening region(s). We could in theory improve this behavior by being 
> more predictive about declaring failure, like employing a phi accrual failure 
> detector to signal to the master from clients that a regionserver is sick. 
> Other time-to-recovery issues and mitigations are discussed in a number of 
> JIRAs and blog posts and not discussed further here. Regarding housekeeping 
> activities, splits and merges typically complete in under a second. However, 
> split times up to ~30 seconds have been observed at my place of employ in 
> rare 

  1   2   >