[jira] [Commented] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-16 Thread Gary Helmling (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819352#comment-16819352
 ] 

Gary Helmling commented on HBASE-17884:
---

FWIW, +1 from me.

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-17884-branch-1.patch, HBASE-17884-branch-1.patch, 
> HBASE-17884.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-16218) Eliminate use of UGI.doAs() in AccessController testing

2019-04-11 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-16218:
--
Fix Version/s: (was: 2.3.0)
   (was: 1.5.0)
   (was: 3.0.0)

No progress from me, and, unfortunately, no bandwidth to handle this in the 
near future.  I've unscheduled from the release branches.  Feel free to close 
it out instead.

> Eliminate use of UGI.doAs() in AccessController testing
> ---
>
> Key: HBASE-16218
> URL: https://issues.apache.org/jira/browse/HBASE-16218
> Project: HBase
>  Issue Type: Sub-task
>  Components: security
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Major
>
> Many tests for AccessController observer coprocessor hooks make use of 
> UGI.doAs() when the test user could simply be passed through.  Eliminate the 
> unnecessary use of doAs().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815679#comment-16815679
 ] 

Gary Helmling commented on HBASE-17884:
---

Reattaching the patch with the correct Jira number this time.

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Attachments: HBASE-17884.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17884:
--
Attachment: HBASE-17884.branch-1.001.patch

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Attachments: HBASE-17884.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17884:
--
Attachment: (was: HBASE-16217.branch-1.001.patch)

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Attachments: HBASE-17884.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17884:
--
Fix Version/s: (was: 1.5.0)

Unscheduling for now, since it's not being actively worked on.  Feel free to 
close it out instead.

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Attachments: HBASE-16217.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815675#comment-16815675
 ] 

Gary Helmling commented on HBASE-17884:
---

Attaching an old patch that I have sitting around applying this to branch-1.  
This is hopelessly out of date, so guaranteed to not apply correctly, but I'm 
parking it here in case anyone wants to pick it up.  Unfortunately, I don't 
have any bandwidth to push this forward, so I'm also fine closing this instead.

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-16217.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-17884) Backport HBASE-16217 to branch-1

2019-04-11 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17884:
--
Attachment: HBASE-16217.branch-1.001.patch

> Backport HBASE-16217 to branch-1
> 
>
> Key: HBASE-17884
> URL: https://issues.apache.org/jira/browse/HBASE-17884
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-16217.branch-1.001.patch
>
>
> The change to add calling user to ObserverContext in HBASE-16217 should also 
> be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
> control checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-11653) RegionObserver coprocessor cannot override KeyValue values in prePut()

2018-08-24 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-11653:
--
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

Resolving since this issue is only present in 0.94 versions, which are no 
longer released.

> RegionObserver coprocessor cannot override KeyValue values in prePut()
> --
>
> Key: HBASE-11653
> URL: https://issues.apache.org/jira/browse/HBASE-11653
> Project: HBase
>  Issue Type: Bug
>  Components: Coprocessors
>Affects Versions: 0.94.21
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Minor
> Attachments: HBASE-11653_0.94.patch
>
>
> Due to a bug in {{HRegion.internalPut()}}, any modifications that a 
> {{RegionObserver}} makes to a Put's family map in the {{prePut()}} hook are 
> lost.
> This prevents coprocessors from modifying the values written by a {{Put}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-19332:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed to branch-1.3+.  Thanks for review [~tedyu].

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Fix For: 2.0.0, 3.0.0, 1.3.2
>
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263236#comment-16263236
 ] 

Gary Helmling commented on HBASE-19332:
---

DumpReplicationQueues was only added to branch-1 in time for 1.3, so this only 
impacts branch-1.3+.

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Fix For: 2.0.0, 3.0.0, 1.3.2
>
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-19332:
--
Fix Version/s: 1.3.2
   3.0.0
   2.0.0

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Fix For: 2.0.0, 3.0.0, 1.3.2
>
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-19332:
--
Status: Patch Available  (was: Open)

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Fix For: 2.0.0, 3.0.0, 1.3.2
>
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-19332:
--
Affects Version/s: 1.3.1

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-19332:
--
Attachment: HBASE-19332.patch

Trivial fix changing int to long.

> DumpReplicationQueues misreports total WAL size
> ---
>
> Key: HBASE-19332
> URL: https://issues.apache.org/jira/browse/HBASE-19332
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Trivial
> Attachments: HBASE-19332.patch
>
>
> DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
> Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19332) DumpReplicationQueues misreports total WAL size

2017-11-22 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-19332:
-

 Summary: DumpReplicationQueues misreports total WAL size
 Key: HBASE-19332
 URL: https://issues.apache.org/jira/browse/HBASE-19332
 Project: HBase
  Issue Type: Bug
  Components: Replication
Reporter: Gary Helmling
Assignee: Gary Helmling
Priority: Trivial


DumpReplicationQueues uses an int to collect the total WAL size for a queue.  
Predictably, this overflows much of the time.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19001) Remove StoreScanner dependency in our own CP related tests

2017-10-16 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206506#comment-16206506
 ] 

Gary Helmling commented on HBASE-19001:
---

Agree with Anoop.  This needs a full description with context, explanation of 
what the replacement for this functionality is, and some plan on how we 
communicate this to downstream users.

I assume this was discussed in a thread on the dev list first.  Can we also 
point to that discussion.

This will break the current implementation of the Apache Tephra 
TransactionProcessor:
https://github.com/apache/incubator-tephra/blob/master/tephra-hbase-compat-1.3/src/main/java/org/apache/tephra/hbase/coprocessor/TransactionProcessor.java

so pointing to some context on why this change was made when downstream users 
come looking for it would be very helpful.



> Remove StoreScanner dependency in our own CP related tests
> --
>
> Key: HBASE-19001
> URL: https://issues.apache.org/jira/browse/HBASE-19001
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0-alpha-4
>
> Attachments: HBASE-19001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18786) FileNotFoundException should not be silently handled for primary region replicas

2017-10-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199483#comment-16199483
 ] 

Gary Helmling commented on HBASE-18786:
---

Seems fine to remove from 1.3.  

handleFileNotFound() was introduced by HBASE-13651 to handle a situation where 
regionserver A is hosting a region and starts a compaction, enters GC pause, 
region is reassigned, then regionserver A emerges from pause and archives the 
compacted files before aborting.  If we really want to handle this situation 
then we need to introduce fencing at the HDFS level during failed server 
processing.  The current situation with handleFileNotFound() seems worse than 
the original problem, since it can hide other problems.

> FileNotFoundException should not be silently handled for primary region 
> replicas
> 
>
> Key: HBASE-18786
> URL: https://issues.apache.org/jira/browse/HBASE-18786
> Project: HBase
>  Issue Type: Sub-task
>  Components: regionserver, Scanners
>Reporter: Ashu Pachauri
>Assignee: Andrew Purtell
> Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
> Attachments: HBASE-18786-branch-1.3.patch, 
> HBASE-18786-branch-1.patch, HBASE-18786-branch-1.patch, HBASE-18786.patch, 
> HBASE-18786.patch
>
>
> This is a follow up for HBASE-18186.
> FileNotFoundException while scanning from a primary region replica can be 
> indicative of a more severe problem. Handling them silently can cause many 
> underlying issues go undetected. We should either
> 1. Hard fail the regionserver if there is a FNFE on a primary region replica, 
> OR
> 2. Report these exceptions as some region / server level metric so that these 
> can be proactively investigated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16231) Integration tests should support client keytab login for secure clusters

2017-08-17 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-16231:
--
Release Note: Prior to this change, the integration test clients 
(IntegrationTest*) relied on the Kerberos credential cache for authentication 
against secured clusters.  This could lead to the tests failing due to 
authentication failures when the tickets in the credential cache expired.  With 
this change, the integration test clients will make use of the configuration 
properties for "hbase.client.keytab.file" and 
"hbase.client.kerberos.principal", when available.  This will perform a login 
from the configured keytab file and automatically refresh the credentials in 
the background for the process lifetime.

> Integration tests should support client keytab login for secure clusters
> 
>
> Key: HBASE-16231
> URL: https://issues.apache.org/jira/browse/HBASE-16231
> Project: HBase
>  Issue Type: Improvement
>  Components: integration tests
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.3.0, 1.4.0
>
> Attachments: HBASE-16231.001.patch
>
>
> Integration tests currently rely on an external kerberos login for secure 
> clusters.  Elsewhere we use AuthUtil to login and refresh the credentials in 
> a background thread.  We should do the same here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18370) Master should attempt reassignment of regions in FAILED_OPEN state

2017-07-13 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086182#comment-16086182
 ] 

Gary Helmling commented on HBASE-18370:
---

One of the problems we have with the region assignment retries in 1.3 and prior 
is the lack of backoff between retry attempts, so we burn through the retries 
quickly.  With HBASE-16209 in branch-1+, we now have a backoff policy for 
region open attempts.  If we just change the default configuration for max 
retries to Integer.MAX_VALUE, this should effectively give us "retry forever" 
for region open, which seems much better than the current behavior.

So I'm not sure we need anything more than a config change.  Are there any 
places where this will not be sufficient?  I'm not sure offhand if we would 
give up on master failover?

> Master should attempt reassignment of regions in FAILED_OPEN state
> --
>
> Key: HBASE-18370
> URL: https://issues.apache.org/jira/browse/HBASE-18370
> Project: HBase
>  Issue Type: Improvement
>Reporter: Andrew Purtell
>
> Currently once a region goes into FAILED_OPEN state this requires operator 
> intervention. With some underlying causes, this is necessary. With others, 
> the master could eventually successfully deploy the region without humans in 
> the loop. The master should optionally attempt automatic resolution of 
> FAILED_OPEN states with a strategy of: delay, unassign, reassign. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18358) Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish' to branch-1.3

2017-07-11 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083150#comment-16083150
 ] 

Gary Helmling commented on HBASE-18358:
---

+1 on patch v3 and the branch-1.3 patch, pending a test run.  Thanks for the 
fixes, Ted!

> Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent 
> Region#flush() to finish' to branch-1.3
> --
>
> Key: HBASE-18358
> URL: https://issues.apache.org/jira/browse/HBASE-18358
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Attachments: 18358.branch-1.3.patch, 18358.v2.txt, 18358.v3.txt
>
>
> HBASE-18099 was only integrated to branch-1 and above in consideration of 
> backward compatibility.
> This issue is to backport the fix to branch-1.3 and branch-1.2.
> Quoting Gary's suggestion from the tail of HBASE-18099 :
> {quote}
> Sure, don't add the method to Region, just to HRegion, check for an instance 
> of HRegion in FlushSnapshotSubprocedure and cast the instance before calling 
> the method.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18358) Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish' to branch-1.3

2017-07-11 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083139#comment-16083139
 ] 

Gary Helmling commented on HBASE-18358:
---

[~jerryhe] as I understand it, creating a flush snapshot guarantees that the 
snapshot contains writes that were acknowledged at least at the time when you 
issue the command.  Without this, it seems pretty useless.  Am I 
misunderstanding the guarantee?

This change just fixes snapshots to guarantee that a flush was 
correctly/successfully performed on every region.

With the first implementation, we could leave memstores unflushed, meaning you 
could have as much as an hour's worth of data from that store missing from the 
snapshot, assuming a periodic flush configuration of 1 hour max.



> Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent 
> Region#flush() to finish' to branch-1.3
> --
>
> Key: HBASE-18358
> URL: https://issues.apache.org/jira/browse/HBASE-18358
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Attachments: 18358.branch-1.3.patch, 18358.v2.txt, 18358.v3.txt
>
>
> HBASE-18099 was only integrated to branch-1 and above in consideration of 
> backward compatibility.
> This issue is to backport the fix to branch-1.3 and branch-1.2.
> Quoting Gary's suggestion from the tail of HBASE-18099 :
> {quote}
> Sure, don't add the method to Region, just to HRegion, check for an instance 
> of HRegion in FlushSnapshotSubprocedure and cast the instance before calling 
> the method.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18358) Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish' to branch-1.3

2017-07-11 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083043#comment-16083043
 ] 

Gary Helmling commented on HBASE-18358:
---

On patch v2, I think we can just get the read point once, outside of the for 
loop.  I don't see a reason to refetch it for every iteration.  We just want to 
make sure that we have flushed past whatever the read point was at the start of 
the subprocedure.

Nit: I would also pull MAX_RETRIES up into a constant with a short comment for 
clarity.

> Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent 
> Region#flush() to finish' to branch-1.3
> --
>
> Key: HBASE-18358
> URL: https://issues.apache.org/jira/browse/HBASE-18358
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Attachments: 18358.branch-1.3.patch, 18358.v2.txt
>
>
> HBASE-18099 was only integrated to branch-1 and above in consideration of 
> backward compatibility.
> This issue is to backport the fix to branch-1.3 and branch-1.2.
> Quoting Gary's suggestion from the tail of HBASE-18099 :
> {quote}
> Sure, don't add the method to Region, just to HRegion, check for an instance 
> of HRegion in FlushSnapshotSubprocedure and cast the instance before calling 
> the method.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18358) Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish' to branch-1.3

2017-07-11 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082802#comment-16082802
 ] 

Gary Helmling commented on HBASE-18358:
---

Thanks for putting up the patch, Ted.

Looking over the approach, I'm not sure this is sufficient, though.

HRegion::flushcache() could return CANNOT_FLUSH when only a single memstore is 
being flushed from a multi column family table.  In this case, we will wait for 
the single memstore to complete, but I think the other memstores could remain 
unflushed and would not be part of the snapshot.

I think we have a couple options:
# in FlushSnapshotSubprocedure, we could still call HRegion::waitForFlushes(), 
then retry calling HRegion::flush(true) a number of time until we get a result 
of FLUSHED_NO_COMPACTION_NEEDED | FLUSHED_COMPACTION_NEEDED |  
CANNOT_FLUSH_MEMSTORE_EMPTY.
# in FlushSnapshotSubprocedure, we could also call HRegion::getReadpoint() 
prior to the flush request and then check HRegion::getMaxFlushedSeqId() after 
calling HRegion::waitForFlushes() to see if we need to retry the call to 
HRegion::flush().  If the max flushed seq ID >= readpoint at the start, then I 
think we can guarantee that all acknowledges writes at the start of the 
snapshot have been persisted.

Sorry to expand the scope here.  This is now well beyond a backport.  Let me 
know your thoughts on these approaches.

> Backport HBASE-18099 'FlushSnapshotSubprocedure should wait for concurrent 
> Region#flush() to finish' to branch-1.3
> --
>
> Key: HBASE-18358
> URL: https://issues.apache.org/jira/browse/HBASE-18358
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Attachments: 18358.branch-1.3.patch
>
>
> HBASE-18099 was only integrated to branch-1 and above in consideration of 
> backward compatibility.
> This issue is to backport the fix to branch-1.3 and branch-1.2.
> Quoting Gary's suggestion from the tail of HBASE-18099 :
> {quote}
> Sure, don't add the method to Region, just to HRegion, check for an instance 
> of HRegion in FlushSnapshotSubprocedure and cast the instance before calling 
> the method.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18099) FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish

2017-07-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081448#comment-16081448
 ] 

Gary Helmling commented on HBASE-18099:
---

bq. Is there suggestion on how backward compatibility can be kept ?

Sure, don't add the method to Region, just to HRegion, check for an instance of 
HRegion in FlushSnapshotSubprocedure and cast the instance before calling the 
method.

This is a critical issue that means that our current snapshot implementation in 
1.3 is broken.  It doesn't guarantee that all of the data that should be 
present in a snapshot is there.  I think all of the users of the impacted 
versions would be interested in a fix.

> FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish
> -
>
> Key: HBASE-18099
> URL: https://issues.apache.org/jira/browse/HBASE-18099
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 18099.v1.txt, 18099.v2.txt, 18099.v3.txt, 18099.v4.txt
>
>
> In the following thread:
> http://search-hadoop.com/m/HBase/YGbbMXkeHlI9zo
> Jacob described the scenario where data from certain region were missing in 
> the snapshot.
> Here was related region server log:
> https://pastebin.com/1ECXjhRp
> He pointed out that concurrent flush from MemStoreFlusher.1 thread was not 
> initiated from the thread pool for snapshot.
> In RegionSnapshotTask#call() method there is this:
> {code}
>   region.flush(true);
> {code}
> The return value is not checked.
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> {code}
>   String msg = "Not flushing since "
>   + (writestate.flushing ? "already flushing"
>   : "writes not enabled");
> {code}
> This implies that FlushSnapshotSubprocedure may incorrectly skip waiting for 
> the concurrent flush to complete.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18099) FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish

2017-07-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081418#comment-16081418
 ] 

Gary Helmling commented on HBASE-18099:
---

I believe this also impacts 1.3.1, as we were just investigating the same issue 
there, and possible also 1.2.  Is there a reason this fix was not committed 
there as well?

> FlushSnapshotSubprocedure should wait for concurrent Region#flush() to finish
> -
>
> Key: HBASE-18099
> URL: https://issues.apache.org/jira/browse/HBASE-18099
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Critical
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 18099.v1.txt, 18099.v2.txt, 18099.v3.txt, 18099.v4.txt
>
>
> In the following thread:
> http://search-hadoop.com/m/HBase/YGbbMXkeHlI9zo
> Jacob described the scenario where data from certain region were missing in 
> the snapshot.
> Here was related region server log:
> https://pastebin.com/1ECXjhRp
> He pointed out that concurrent flush from MemStoreFlusher.1 thread was not 
> initiated from the thread pool for snapshot.
> In RegionSnapshotTask#call() method there is this:
> {code}
>   region.flush(true);
> {code}
> The return value is not checked.
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> {code}
>   String msg = "Not flushing since "
>   + (writestate.flushing ? "already flushing"
>   : "writes not enabled");
> {code}
> This implies that FlushSnapshotSubprocedure may incorrectly skip waiting for 
> the concurrent flush to complete.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-08 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-18141:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.2.7
   1.4.0
   3.0.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.2+.

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 2.0.0, 3.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18141.001.patch, HBASE-18141.branch-1.3.001.patch, 
> HBASE-18141.branch-1.3.002.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-07 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-18141:
--
Attachment: HBASE-18141.branch-1.3.002.patch

Updated patch against branch-1.3 adding a file header and test category to the 
test case.

Both this and the master patch contain the same 2 fixes to the regionserver 
abort path, in slightly different forms:

# adds a flag to stop() indicating whether any exception thrown by the 
RegionServerObserver.preRegionServerStop() should be ignored
# ensures that abort() calls stop() using the logged in process user, not the 
rpc caller

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 1.3.2
>
> Attachments: HBASE-18141.001.patch, HBASE-18141.branch-1.3.001.patch, 
> HBASE-18141.branch-1.3.002.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-07 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-18141:
--
Attachment: HBASE-18141.001.patch

Attaching a patch against master with the fixes to HRegionServer.abort() and 
HRegionServer.stop(), along with a more targeted test.

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 1.3.2
>
> Attachments: HBASE-18141.001.patch, HBASE-18141.branch-1.3.001.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034244#comment-16034244
 ] 

Gary Helmling commented on HBASE-18141:
---

The code path where I originally observed this and that TestRegionServerAbort 
exercises has been changed in master and branch-1 by HBASE-17712, so that the 
regionserver no longer aborts.  I think the changes to HRegionServer are still 
relevant, as I do not think a coprocessor should ever be able to reject an 
internally triggered abort.  However, I will need to rework the test for master.

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 1.3.2
>
> Attachments: HBASE-18141.branch-1.3.001.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-01 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-18141:
--
Status: Patch Available  (was: Open)

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 1.3.2
>
> Attachments: HBASE-18141.branch-1.3.001.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-06-01 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-18141:
--
Attachment: HBASE-18141.branch-1.3.001.patch

Attaching a patch against branch-1.3

> Regionserver fails to shutdown when abort triggered in RegionScannerImpl 
> during RPC call
> 
>
> Key: HBASE-18141
> URL: https://issues.apache.org/jira/browse/HBASE-18141
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, security
>Affects Versions: 1.3.1
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
> Fix For: 1.3.2
>
> Attachments: HBASE-18141.branch-1.3.001.patch
>
>
> When an abort is triggered within the RPC call path by 
> HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC 
> caller identity in the RegionServerObserver.preStopRegionServer() hook.  This 
> leaves the regionserver in a non-responsive state, where its regions are not 
> reassigned and it returns exceptions for all requests.
> When an abort is triggered on the server side, we should not allow a 
> coprocessor to reject the abort at all.
> Here is a sample stack trace:
> {noformat}
> 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: 
> loaded coprocessors are: 
> [org.apache.hadoop.hbase.security.access.AccessController, 
> org.apache.hadoop.hbase.security.token.TokenProvider]
> 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
> stop
> org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
> permissions for user 'rpcuser' (global, action=ADMIN)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
> at 
> org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
> at 
> org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> {noformat}
> I haven't yet evaluated which other release branches this might apply to.
> I have a patch currently in progress, which I will post as soon as I complete 
> a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call

2017-05-31 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-18141:
-

 Summary: Regionserver fails to shutdown when abort triggered in 
RegionScannerImpl during RPC call
 Key: HBASE-18141
 URL: https://issues.apache.org/jira/browse/HBASE-18141
 Project: HBase
  Issue Type: Bug
  Components: regionserver, security
Affects Versions: 1.3.1
Reporter: Gary Helmling
Assignee: Gary Helmling
Priority: Critical
 Fix For: 1.3.2


When an abort is triggered within the RPC call path by 
HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC caller 
identity in the RegionServerObserver.preStopRegionServer() hook.  This leaves 
the regionserver in a non-responsive state, where its regions are not 
reassigned and it returns exceptions for all requests.

When an abort is triggered on the server side, we should not allow a 
coprocessor to reject the abort at all.

Here is a sample stack trace:
{noformat}
17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: loaded 
coprocessors are: [org.apache.hadoop.hbase.security.access.AccessController, 
org.apache.hadoop.hbase.security.token.TokenProvider]
17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not 
stop
org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
permissions for user 'rpcuser' (global, action=ADMIN)
at 
org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548)
at 
org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522)
at 
org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501)
at 
org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86)
at 
org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300)
at 
org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
{noformat}

I haven't yet evaluated which other release branches this might apply to.

I have a patch currently in progress, which I will post as soon as I complete a 
test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18095) Provide an option for clients to find the server hosting META that does not involve the ZooKeeper client

2017-05-30 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030415#comment-16030415
 ] 

Gary Helmling commented on HBASE-18095:
---

bq. Cluster ID lookup is most easily accomplished with a new servlet on the 
HTTP(S) endpoint on the masters, serving the cluster ID as plain text

Seems like the best option.  Would this still work if SPNEGO is enabled?  If 
not, I'm not sure what else we can do short of a totally new endpoint.

> Provide an option for clients to find the server hosting META that does not 
> involve the ZooKeeper client
> 
>
> Key: HBASE-18095
> URL: https://issues.apache.org/jira/browse/HBASE-18095
> Project: HBase
>  Issue Type: New Feature
>  Components: Client
>Reporter: Andrew Purtell
>
> Clients are required to connect to ZooKeeper to find the location of the 
> regionserver hosting the meta table region. Site configuration provides the 
> client a list of ZK quorum peers and the client uses an embedded ZK client to 
> query meta location. Timeouts and retry behavior of this embedded ZK client 
> are managed orthogonally to HBase layer settings and in some cases the ZK 
> cannot manage what in theory the HBase client can, i.e. fail fast upon outage 
> or network partition.
> We should consider new configuration settings that provide a list of 
> well-known master and backup master locations, and with this information the 
> client can contact any of the master processes directly. Any master in either 
> active or passive state will track meta location and respond to requests for 
> it with its cached last known location. If this location is stale, the client 
> can ask again with a flag set that requests the master refresh its location 
> cache and return the up-to-date location. Every client interaction with the 
> cluster thus uses only HBase RPC as transport, with appropriate settings 
> applied to the connection. The configuration toggle that enables this 
> alternative meta location lookup should be false by default.
> This removes the requirement that HBase clients embed the ZK client and 
> contact the ZK service directly at the beginning of the connection lifecycle. 
> This has several benefits. ZK service need not be exposed to clients, and 
> their potential abuse, yet no benefit ZK provides the HBase server cluster is 
> compromised. Normalizing HBase client and ZK client timeout settings and 
> retry behavior - in some cases, impossible, i.e. for fail-fast - is no longer 
> necessary. 
> And, from [~ghelmling]: There is an additional complication here for 
> token-based authentication. When a delegation token is used for SASL 
> authentication, the client uses the cluster ID obtained from Zookeeper to 
> select the token identifier to use. So there would also need to be some 
> Zookeeper-less, unauthenticated way to obtain the cluster ID as well. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18095) Provide an option for clients to find the server hosting META that does not involve the ZooKeeper client

2017-05-23 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021661#comment-16021661
 ] 

Gary Helmling commented on HBASE-18095:
---

There is an additional complication here for token-based authentication.  When 
a delegation token is used for SASL authentication, the client uses the cluster 
ID obtained from Zookeeper to select the token identifier to use.  So there 
would also need to be some Zookeeper-less, unauthenticated way to obtain the 
cluster ID as well.

> Provide an option for clients to find the server hosting META that does not 
> involve the ZooKeeper client
> 
>
> Key: HBASE-18095
> URL: https://issues.apache.org/jira/browse/HBASE-18095
> Project: HBase
>  Issue Type: New Feature
>  Components: Client
>Reporter: Andrew Purtell
>
> Clients are required to connect to ZooKeeper to find the location of the 
> regionserver hosting the meta table region. Site configuration provides the 
> client a list of ZK quorum peers and the client uses an embedded ZK client to 
> query meta location. Timeouts and retry behavior of this embedded ZK client 
> are managed orthogonally to HBase layer settings and in some cases the ZK 
> cannot manage what in theory the HBase client can, i.e. fail fast upon outage 
> or network partition.
> We should consider new configuration settings that provide a list of 
> well-known master and backup master locations, and with this information the 
> client can contact any of the master processes directly. Any master in either 
> active or passive state will track meta location and respond to requests for 
> it with its cached last known location. If this location is stale, the client 
> can ask again with a flag set that requests the master refresh its location 
> cache and return the up-to-date location. Every client interaction with the 
> cluster thus uses only HBase RPC as transport, with appropriate settings 
> applied to the connection. The configuration toggle that enables this 
> alternative meta location lookup should be false by default.
> This removes the requirement that HBase clients embed the ZK client and 
> contact the ZK service directly at the beginning of the connection lifecycle. 
> This has several benefits. ZK service need not be exposed to clients, and 
> their potential abuse, yet no benefit ZK provides the HBase server cluster is 
> compromised. Normalizing HBase client and ZK client timeout settings and 
> retry behavior - in some cases, impossible, i.e. for fail-fast - is no longer 
> necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18072) Malformed Cell from client causes Regionserver abort on flush

2017-05-18 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-18072:
-

 Summary: Malformed Cell from client causes Regionserver abort on 
flush
 Key: HBASE-18072
 URL: https://issues.apache.org/jira/browse/HBASE-18072
 Project: HBase
  Issue Type: Bug
  Components: regionserver, rpc
Affects Versions: 1.3.0
Reporter: Gary Helmling
Assignee: Gary Helmling
Priority: Critical


When a client writes a mutation with a Cell with a corrupted value length 
field, it is possible for the corrupt cell to trigger an exception on memstore 
flush, which will trigger regionserver aborts until the region is manually 
recovered.

This boils down to a lack of validation on the client submitted byte[] backing 
the cell.

Consider the following sequence:

1. Client creates a new Put with a cell with value of byte[16]
2. When the backing KeyValue for the Put is created, we serialize 16 for the 
value length field in the backing array
3. Client calls Table.put()
4. RpcClientImpl calls KeyValueEncoder.encode() to serialize the Cell to the 
OutputStream
5. Memory corruption in the backing array changes the serialized contents of 
the value length field from 16 to 48
6. Regionserver handling the put uses KeyValueDecoder.decode() to create a 
KeyValue with the byte[] read directly off the InputStream.  The overall length 
of the array is correct, but the integer value serialized at the value length 
offset has been corrupted from the original value of 16 to 48.
7. The corrupt KeyValue is appended to the WAL and added to the memstore
8. After some time, the memstore flushes.  As HFileWriter is writing out the 
corrupted cell, it reads the serialized int from the value length position in 
the cell's byte[] to determine the number of bytes to write for the value.  
Because value offset + 48 is greater than the length of the cell's byte[], we 
hit an IndexOutOfBoundsException:
{noformat}
java.lang.IndexOutOfBoundsException
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:151)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.hbase.io.hfile.NoOpDataBlockEncoder.encode(NoOpDataBlockEncoder.java:56)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(HFileBlock.java:954)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:284)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1041)
at 
org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:138)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
at 
org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:937)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2413)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2456)
{noformat}
9. Regionserver aborts due to the failed flush
10. The regionserver WAL is split into recovered.edits files, one of these 
containing the same corrupted cell
11. A new regionserver is assigned the region with the corrupted write
12. The new regionserver replays the recovered.edits entries into memstore and 
then tries to flush the memstore to an HFile
13. The flush triggers the same IndexOutOfBoundsException, causing us to go 
back to step #8 and loop on repeat until manual intervention is taken

The corrupted cell basically becomes a poison pill that aborts regionservers 
one at a time as the region with the problem edit is passed around.  This also 
means that a malicious client could easily construct requests allowing a denial 
of service attack against regionservers hosting any tables that the client has 
write access to.

At bare minimum, I think we need to do a sanity check on all the lengths for 
Cells read off the CellScanner for incoming requests.  This would allow us to 
reject corrupt cells before we append them to the WAL and succeed the request, 
putting us in a position where we cannot recover.  This would only detect the 
corruption of length fields which puts us in a bad state.

Whether or not Cells should carry some checksum generated at the time the Cell 
is created, which could then validated on the server-side, is a separate 
question.  This would allow detection of other parts of the backing cell 
byte[], such as within the key fields or the value field.  But the computer 
overhead of this may be too heavyweight to be practical.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17884) Backport HBASE-16217 to branch-1

2017-04-05 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-17884:
-

 Summary: Backport HBASE-16217 to branch-1
 Key: HBASE-17884
 URL: https://issues.apache.org/jira/browse/HBASE-17884
 Project: HBase
  Issue Type: Sub-task
Reporter: Gary Helmling


The change to add calling user to ObserverContext in HBASE-16217 should also be 
applied to branch-1 to avoid use of UserGroupInformation.doAs() for access 
control checks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-16217) Identify calling user in ObserverContext

2017-04-05 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-16217:
--
   Resolution: Fixed
Fix Version/s: (was: 1.4.0)
   Status: Resolved  (was: Patch Available)

This was committed to master quite a while ago and the patch against branch-1 
has gone way stale while waiting on a hibernating HadoopQA.  I'll close this 
out and open a separate JIRA for a backport.

> Identify calling user in ObserverContext
> 
>
> Key: HBASE-16217
> URL: https://issues.apache.org/jira/browse/HBASE-16217
> Project: HBase
>  Issue Type: Sub-task
>  Components: Coprocessors, security
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0
>
> Attachments: HBASE-16217.branch-1.001.patch, 
> HBASE-16217.master.001.patch, HBASE-16217.master.002.patch, 
> HBASE-16217.master.003.patch
>
>
> We already either explicitly pass down the relevant User instance initiating 
> an action through the call path, or it is available through 
> RpcServer.getRequestUser().  We should carry this through in the 
> ObserverContext for coprocessor upcalls and make use of it for permissions 
> checking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HBASE-17816) HRegion#mutateRowWithLocks should update writeRequestCount metric

2017-03-29 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling reassigned HBASE-17816:
-

Assignee: Weizhan Zeng  (was: Ashu Pachauri)

[~Weizhan Zeng], I've assigned the issue to you.  This will allow you to upload 
a patch.  From the issue, select More -> Attach Files.

> HRegion#mutateRowWithLocks should update writeRequestCount metric
> -
>
> Key: HBASE-17816
> URL: https://issues.apache.org/jira/browse/HBASE-17816
> Project: HBase
>  Issue Type: Bug
>Reporter: Ashu Pachauri
>Assignee: Weizhan Zeng
> Attachments: HBASE-17816.master.001.patch
>
>
> Currently, all the calls that use HRegion#mutateRowWithLocks miss 
> writeRequestCount metric. The mutateRowWithLocks base method should update 
> the metric.
> Examples are checkAndMutate calls through RSRpcServices#multi, 
> Region#mutateRow api , MultiRowMutationProcessor coprocessor endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17827) Client tools relying on AuthUtil.getAuthChore() break credential cache login

2017-03-27 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943895#comment-15943895
 ] 

Gary Helmling commented on HBASE-17827:
---

bq. Are you still going to use the same chore mechanism to re-login, from 
cache, even the cache has limited lifetime?

No, for logins from the credential cache, a background thread in 
UserGroupInformation will renew the TGT up to the ticket lifetime.  So there's 
nothing for the chore to do here.  The idea is to just have getAuthChore() 
return null, same as if security is not configured, and let the normal UGI 
login from credential cache happen.  I'll put up a patch later today.

> Client tools relying on AuthUtil.getAuthChore() break credential cache login
> 
>
> Key: HBASE-17827
> URL: https://issues.apache.org/jira/browse/HBASE-17827
> Project: HBase
>  Issue Type: Bug
>  Components: canary, security
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
>
> Client tools, such as Canary, which make use of keytab based logins with 
> AuthUtil.getAuthChore() do not allow any way to continue without a 
> keytab-based login when security is enabled.  Currently, when security is 
> enabled and the configuration lacks {{hbase.client.keytab.file}}, these tools 
> would fail with:
> {noformat}
> ERROR hbase.AuthUtil: Error while trying to perform the initial login: 
> Running in secure mode, but config doesn't have a keytab
> java.io.IOException: Running in secure mode, but config doesn't have a keytab
> at 
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
> at org.apache.hadoop.hbase.security.User.login(User.java:258)
> at 
> org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
> at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
> at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
> Exception in thread "main" java.io.IOException: Running in secure mode, but 
> config doesn't have a keytab
> at 
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
> at org.apache.hadoop.hbase.security.User.login(User.java:258)
> at 
> org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
> at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
> at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
> {noformat}
> These tools should still work with the default credential-cache login, at 
> least when a client keytab is not configured.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-16755) Honor flush policy under global memstore pressure

2017-03-27 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-16755:
--
  Resolution: Fixed
Release Note: Prior to this change, when the memstore low water mark is 
exceeded on a regionserver, the regionserver will force flush all stores on the 
regions selected for flushing until we drop below the low water mark.  With 
this change, the regionserver will continue to force flush regions when above 
the memstore low water mark, but will only flush the stores returned by the 
configured FlushPolicy.
  Status: Resolved  (was: Patch Available)

Committed to branch-1.3+.

[~ashu210890], please open a new issue to add safety checking in master & 
branch-1 that the flush policy has actually flushed something, as discussed in 
the comments here.

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HBASE-17827) Client tools relying on AuthUtil.getAuthChore() break credential cache login

2017-03-23 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939210#comment-15939210
 ] 

Gary Helmling edited comment on HBASE-17827 at 3/23/17 8:53 PM:


bq. counter argument (that I don't think I agree with): AuthUtil.getAuthChore 
is meant for long running applications and as such users shouldn't be deploying 
applications that are based on it using a local kinit that will inevitably fail 
once renewal lifetimes are exceeded.

Yeah, there's certainly an argument there, which is useful in thinking about 
how to approach this.  My first take is if hbase.client.keytab.file is not 
configured or is empty, to log a warning and fall back to the credential cache 
behavior.  The log would at least give an indication on what it's doing, with 
instructions on what to configure for keytab logins.

The other approach I can think of is to require a config property to be set to 
override the keytab login.  So rather than the keytab config being missing (or 
overridden) in the config, you have to set say 
hbase.client.security.ccache=true, in which case getAuthChore() could skip the 
keytab login.

My use case was wanting to run the Canary tool as a different user with a 
credential cache (and on a different host without the keytab file) in order to 
test access.  So I think either of these would work for me.

Our only internal use of AuthUtil.getAuthChore() is in IntegrationTestBase and 
Canary.  But since AuthUtil is now part of the public API, we also need to 
consider if the current behavior is something users may be relying on.  If so, 
then I think the second approach better retains that compatibility, but I'm 
open to either.


was (Author: ghelmling):
bq. counter argument (that I don't think I agree with): AuthUtil.getAuthChore 
is meant for long running applications and as such users shouldn't be deploying 
applications that are based on it using a local kinit that will inevitably fail 
once renewal lifetimes are exceeded.

Yeah, there's certainly an argument there, which is useful in thinking about 
how to approach this.  My first take is if hbase.client.keytab.file is not 
configured or is empty, to log a warning and fall back to the credential cache 
behavior.  The log would at least give an indication on what it's doing, with 
instructions on what to configure for keytab logins.

The other approach I can think of is to require a config property to be set to 
override the keytab login.  So rather than the keytab config being missing (or 
overridden) in the config, you have to set say 
hbase.client.security.ccache=true, in which case getAuthChore() could skip the 
keytab login.

My use case was wanting to run the Canary tool as a different user with a 
credential cache (and on a different host without the keytab file) in order to 
test access.  So I think either of these would work for me.

Our only internal use of AuthUtil.getAuthChore() is in IntegrationTestBase and 
Canary.  But since AuthUtil is now part of the public API, we also need to 
consider which if the current behavior is something users may be relying on.  
If so, then I think the second approach better retains that compatibility, but 
I'm open to either.

> Client tools relying on AuthUtil.getAuthChore() break credential cache login
> 
>
> Key: HBASE-17827
> URL: https://issues.apache.org/jira/browse/HBASE-17827
> Project: HBase
>  Issue Type: Bug
>  Components: canary, security
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
>
> Client tools, such as Canary, which make use of keytab based logins with 
> AuthUtil.getAuthChore() do not allow any way to continue without a 
> keytab-based login when security is enabled.  Currently, when security is 
> enabled and the configuration lacks {{hbase.client.keytab.file}}, these tools 
> would fail with:
> {noformat}
> ERROR hbase.AuthUtil: Error while trying to perform the initial login: 
> Running in secure mode, but config doesn't have a keytab
> java.io.IOException: Running in secure mode, but config doesn't have a keytab
> at 
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
> at org.apache.hadoop.hbase.security.User.login(User.java:258)
> at 
> org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
> at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
> at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
> Exception in thread "main" 

[jira] [Commented] (HBASE-17827) Client tools relying on AuthUtil.getAuthChore() break credential cache login

2017-03-23 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939210#comment-15939210
 ] 

Gary Helmling commented on HBASE-17827:
---

bq. counter argument (that I don't think I agree with): AuthUtil.getAuthChore 
is meant for long running applications and as such users shouldn't be deploying 
applications that are based on it using a local kinit that will inevitably fail 
once renewal lifetimes are exceeded.

Yeah, there's certainly an argument there, which is useful in thinking about 
how to approach this.  My first take is if hbase.client.keytab.file is not 
configured or is empty, to log a warning and fall back to the credential cache 
behavior.  The log would at least give an indication on what it's doing, with 
instructions on what to configure for keytab logins.

The other approach I can think of is to require a config property to be set to 
override the keytab login.  So rather than the keytab config being missing (or 
overridden) in the config, you have to set say 
hbase.client.security.ccache=true, in which case getAuthChore() could skip the 
keytab login.

My use case was wanting to run the Canary tool as a different user with a 
credential cache (and on a different host without the keytab file) in order to 
test access.  So I think either of these would work for me.

Our only internal use of AuthUtil.getAuthChore() is in IntegrationTestBase and 
Canary.  But since AuthUtil is now part of the public API, we also need to 
consider which if the current behavior is something users may be relying on.  
If so, then I think the second approach better retains that compatibility, but 
I'm open to either.

> Client tools relying on AuthUtil.getAuthChore() break credential cache login
> 
>
> Key: HBASE-17827
> URL: https://issues.apache.org/jira/browse/HBASE-17827
> Project: HBase
>  Issue Type: Bug
>  Components: canary, security
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>Priority: Critical
>
> Client tools, such as Canary, which make use of keytab based logins with 
> AuthUtil.getAuthChore() do not allow any way to continue without a 
> keytab-based login when security is enabled.  Currently, when security is 
> enabled and the configuration lacks {{hbase.client.keytab.file}}, these tools 
> would fail with:
> {noformat}
> ERROR hbase.AuthUtil: Error while trying to perform the initial login: 
> Running in secure mode, but config doesn't have a keytab
> java.io.IOException: Running in secure mode, but config doesn't have a keytab
> at 
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
> at org.apache.hadoop.hbase.security.User.login(User.java:258)
> at 
> org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
> at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
> at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
> Exception in thread "main" java.io.IOException: Running in secure mode, but 
> config doesn't have a keytab
> at 
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
> at org.apache.hadoop.hbase.security.User.login(User.java:258)
> at 
> org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
> at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
> at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
> {noformat}
> These tools should still work with the default credential-cache login, at 
> least when a client keytab is not configured.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17827) Client tools relying on AuthUtil.getAuthChore() break credential cache login

2017-03-23 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-17827:
-

 Summary: Client tools relying on AuthUtil.getAuthChore() break 
credential cache login
 Key: HBASE-17827
 URL: https://issues.apache.org/jira/browse/HBASE-17827
 Project: HBase
  Issue Type: Bug
  Components: canary, security
Reporter: Gary Helmling
Assignee: Gary Helmling


Client tools, such as Canary, which make use of keytab based logins with 
AuthUtil.getAuthChore() do not allow any way to continue without a keytab-based 
login when security is enabled.  Currently, when security is enabled and the 
configuration lacks {{hbase.client.keytab.file}}, these tools would fail with:

{noformat}
ERROR hbase.AuthUtil: Error while trying to perform the initial login: Running 
in secure mode, but config doesn't have a keytab
java.io.IOException: Running in secure mode, but config doesn't have a keytab
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
at 
org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
at org.apache.hadoop.hbase.security.User.login(User.java:258)
at 
org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
Exception in thread "main" java.io.IOException: Running in secure mode, but 
config doesn't have a keytab
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239)
at 
org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420)
at org.apache.hadoop.hbase.security.User.login(User.java:258)
at 
org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197)
at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98)
at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327)
{noformat}

These tools should still work with the default credential-cache login, at least 
when a client keytab is not configured.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16755) Honor flush policy under global memstore pressure

2017-03-20 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933867#comment-15933867
 ] 

Gary Helmling commented on HBASE-16755:
---

[~enis], [~Apache9] -- how about the following approach:

* branch-1.3: commit the current patch.  All of our existing FlushPolicy 
implementations will flush something, but this won't enforce that something has 
been flushed by any possible FlushPolicy implementation.
* branch-1 & master: in addition, make one of the changes Ashu described to 
allow the flusher to enforce that something has always flushed, regardless of 
the FlushPolicy implementation.

Does this seem reasonable, or are there concerns about the current patch going 
in to 1.3.1?

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HBASE-12579) Move obtainAuthTokenForJob() methods out of User

2017-03-20 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling resolved HBASE-12579.
---
Resolution: Duplicate

The methods were deprecated and the existing usage was removed as part of 
HBASE-12493.  I guess I left this open for the final removal of the deprecated 
methods from the next major release.  The removal was done as part of 
HBASE-14208.

> Move obtainAuthTokenForJob() methods out of User
> 
>
> Key: HBASE-12579
> URL: https://issues.apache.org/jira/browse/HBASE-12579
> Project: HBase
>  Issue Type: Improvement
>  Components: security
>Reporter: Gary Helmling
>
> The {{User}} class currently contains some utility methods to obtain HBase 
> authentication tokens for the given user.  However, these methods initiate an 
> RPC to the {{TokenProvider}} coprocessor endpoint, an action which should not 
> be part of the User class' responsibilities.
> This leads to a couple of problems:
> # The way the methods are currently structured, it is impossible to integrate 
> them with normal connection management for the cluster (the TokenUtil class 
> constructs its own HTable instance internally).
> # The User class is logically part of the hbase-common module, but uses the 
> TokenUtil class (part of hbase-server, though it should probably be moved to 
> hbase-client) through reflection, leading to a hidden dependency.
> The {{obtainAuthTokenForJob()}} methods should be deprecated and the process 
> of obtaining authentication tokens should be moved to use the normal 
> connection lifecycle.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-15429) Add a split policy for busy regions

2017-03-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-15429:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed the branch-1 patch to branch-1 and branch-1.3.

[~mantonov] I pulled this in to 1.3 since it's completely self contained and 
not used unless enabled.  If you see an issue, let me know and I can revert.

> Add a split policy for busy regions
> ---
>
> Key: HBASE-15429
> URL: https://issues.apache.org/jira/browse/HBASE-15429
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-15429.branch-1.patch, HBASE-15429.patch, 
> HBASE-15429-V1.patch, HBASE-15429-V2.patch
>
>
> Currently, all region split policies are based on size. However, in certain 
> cases, it is a wise choice to make a split decision based on number of 
> requests to the region and split busy regions.
> A crude metric is that if a region blocks writes often and throws 
> RegionTooBusyExceoption, it's probably a good idea to split it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-16977) VerifyReplication should log a printable representation of the row keys

2017-03-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-16977:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to branch-1.3+.

> VerifyReplication should log a printable representation of the row keys
> ---
>
> Key: HBASE-16977
> URL: https://issues.apache.org/jira/browse/HBASE-16977
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Minor
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-16977.branch-1.3.001.patch, 
> HBASE-16977.master.001.patch, HBASE-16977.V1.patch
>
>
> VerifyReplication prints out the row keys for offending rows in the task logs 
> for the MR job. However, the log is useless if the row key contains non 
> printable characters. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17579:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to branch-1.3.  [~ashu210890], can you add a release note here?

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch, HBASE-17579.branch-1.3.003.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-09 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903781#comment-15903781
 ] 

Gary Helmling commented on HBASE-17579:
---

I manually triggered a pre-commit build to get some test results.

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch, HBASE-17579.branch-1.3.003.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16977) VerifyReplication should log a printable representation of the row keys

2017-03-09 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903666#comment-15903666
 ] 

Gary Helmling commented on HBASE-16977:
---

Should have committed this one a long time ago, but it's gone stale.  The last 
3 instances in the patch have been fixed in master, but the first has not.  
[~ashu210890], do you mind rebasing?

> VerifyReplication should log a printable representation of the row keys
> ---
>
> Key: HBASE-16977
> URL: https://issues.apache.org/jira/browse/HBASE-16977
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Minor
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-16977.V1.patch
>
>
> VerifyReplication prints out the row keys for offending rows in the task logs 
> for the MR job. However, the log is useless if the row key contains non 
> printable characters. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17579:
--
Status: Open  (was: Patch Available)

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch, HBASE-17579.branch-1.3.003.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17579:
--
Status: Patch Available  (was: Open)

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch, HBASE-17579.branch-1.3.003.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16755) Honor flush policy under global memstore pressure

2017-03-03 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894773#comment-15894773
 ] 

Gary Helmling commented on HBASE-16755:
---

bq. This seems like an improvement, and not a bug. And the risk is non-trivial 
as per above. So, I would say it should not go to 1.3.1 anyway. It can be 1.4, 
and 2.0.

I would argue it is a bug that we're not respecting the configured flush policy 
under global memstore pressure.  And the performance degradation that can occur 
with tables with multiple column family with divergent write rates can be 
pretty severe.  I don't know why we wouldn't want to continue to allow 
FlushLargeStoresPolicy to choose the right stores to flush in this situation.  
It does still fall back to all stores in the case where no stores meet the 
flush threshold.

If we don't fix this in 1.3.1, then we're not giving users any recourse if they 
run across this, since even writing your own FlushPolicy can't override the 
behavior.

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893578#comment-15893578
 ] 

Gary Helmling edited comment on HBASE-17579 at 3/3/17 2:20 AM:
---

You also need to change gaugesMap from Map to ConcurrentMap, as 
Map.putIfAbsent() is only present in Java 8 and HBase 1.3 supports both 7 & 8.  
That's the reason for the compilation failure in the 1.7 builds above.


was (Author: ghelmling):
You also need to change gaugesMap from Map to ConcurrentMap, as 
Map.putIfAbsent() is only present in Java 8 and HBase 1.3 supports both 7 & 8.

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893578#comment-15893578
 ] 

Gary Helmling commented on HBASE-17579:
---

You also need to change gaugesMap from Map to ConcurrentMap, as 
Map.putIfAbsent() is only present in Java 8 and HBase 1.3 supports both 7 & 8.

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-03-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893559#comment-15893559
 ] 

Gary Helmling commented on HBASE-17579:
---

I tested this locally again and confirmed that patch v2 does retain the 
existing "source.ageOfLastShippedOp" and "sink.ageOfLastAppliedOp" metric names 
for backwards compatibility.

One comment on the patch: there's a race in CompatibilityRegistry.getGauge().  
The method should return the value from gaugesMap.putIfAbsent() if it is 
non-null, instead of gauge.  Otherwise the gauge reference it returns will not 
be referenced in the map.

Otherwise the patch looks good to me.

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch, 
> HBASE-17579.branch-1.3.002.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17704) Regions stuck in FAILED_OPEN when HDFS blocks are missing

2017-03-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892873#comment-15892873
 ] 

Gary Helmling commented on HBASE-17704:
---

Just to be clear, I'd also be in favor of changing the default for this config 
to Integer.MAX_VALUE for 1.4.0 and 2.0.0.  The current situation having 
FAILED_OPEN be a terminal state requiring operator intervention is pretty bad 
and seems unnecessary.

It could be that I'm missing something else that's necessary, but that seems 
like an appropriate fix for this issue.

> Regions stuck in FAILED_OPEN when HDFS blocks are missing
> -
>
> Key: HBASE-17704
> URL: https://issues.apache.org/jira/browse/HBASE-17704
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.1.8
>Reporter: Mathias Herberts
>
> We recently experienced the loss of a whole rack (6 DNs + RS) in a 120 node 
> cluster. This lead to the regions which were present on the 6 RS which became 
> unavailable to be reassigned to live RSs. When attempting to open some of the 
> reassigned regions, some RS encountered missing blocks and issued "No live 
> nodes contain current block Block locations" putting the regions in state 
> FAILED_OPEN.
> Once the disappeared DNs went back online, the regions were left in 
> FAILED_OPEN, needing a restart of all the affected RSs to solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17704) Regions stuck in FAILED_OPEN when HDFS blocks are missing

2017-03-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891281#comment-15891281
 ] 

Gary Helmling commented on HBASE-17704:
---

So HBASE-16209 added a backoff policy for retries of region open, without which 
regions would go into FAILED_OPEN quickly.  So maybe all that's needed is bump 
up the configuration for maximum attempts ("hbase.assignment.maximum.attempts") 
to Integer.MAX_VALUE?

> Regions stuck in FAILED_OPEN when HDFS blocks are missing
> -
>
> Key: HBASE-17704
> URL: https://issues.apache.org/jira/browse/HBASE-17704
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.1.8
>Reporter: Mathias Herberts
>
> We recently experienced the loss of a whole rack (6 DNs + RS) in a 120 node 
> cluster. This lead to the regions which were present on the 6 RS which became 
> unavailable to be reassigned to live RSs. When attempting to open some of the 
> reassigned regions, some RS encountered missing blocks and issued "No live 
> nodes contain current block Block locations" putting the regions in state 
> FAILED_OPEN.
> Once the disappeared DNs went back online, the regions were left in 
> FAILED_OPEN, needing a restart of all the affected RSs to solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16755) Honor flush policy under global memstore pressure

2017-03-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890901#comment-15890901
 ] 

Gary Helmling commented on HBASE-16755:
---

[~enis], yes, all of our current flush policies will fall back to all stores if 
none of the stores meets the threshold.

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16755) Honor flush policy under global memstore pressure

2017-03-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890899#comment-15890899
 ] 

Gary Helmling commented on HBASE-16755:
---

[~Apache9], yes, currently with either FlushAllLargeStoresPolicy or 
FlushNonSloppyStoresFirstPolicy, we still will fall back to all stores in the 
case that none of the stores meets the flush threshold.  So we will still 
ensure that something is always flushed.

Thanks for taking a look, I'll go ahead and commit.

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-16755) Honor flush policy under global memstore pressure

2017-02-27 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887056#comment-15887056
 ] 

Gary Helmling commented on HBASE-16755:
---

+1 on the patch.

Since we have both FlushLargeStoresPolicy and FlushAllStoresPolicy (and more 
options in master), makes sense to me to honor that policy, even under global 
memstore pressure.

[~Apache9] any thoughts on this change, since you ported over the per-store 
flush decisions originally?  We've seen situations where, due to hitting global 
memstore pressure, we constantly flush lots of small files, even when stores 
are eligible to flush with FlushLargeStoresPolicy.

> Honor flush policy under global memstore pressure
> -
>
> Key: HBASE-16755
> URL: https://issues.apache.org/jira/browse/HBASE-16755
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-16755.v0.patch
>
>
> When global memstore reaches the low water mark, we pick the best flushable 
> region and flush all column families for it. This is a suboptimal approach in 
> the  sense that it leads to an unnecessarily high file creation rate and IO 
> amplification due to compactions. We should still try to honor the underlying 
> FlushPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17057) Minor compactions should also drop page cache behind reads

2017-02-24 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17057:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to branch-1.3+.

[~ashu210890], can you add a release note for this?

> Minor compactions should also drop page cache behind reads
> --
>
> Key: HBASE-17057
> URL: https://issues.apache.org/jira/browse/HBASE-17057
> Project: HBase
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17057.V1.patch, HBASE-17057.V2.patch
>
>
> Long compactions currently drop cache behind reads/writes so that they don't 
> pollute the page cache but short compactions don't do that. The bulk of the 
> data is actually read during minor compactions instead of major compactions,  
> and thrashes the page cache since it's mostly not needed. 
> We should drop page cache behind minor compactions too. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17582) Drop page cache hint is broken

2017-02-22 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879392#comment-15879392
 ] 

Gary Helmling commented on HBASE-17582:
---

+1 on the patch.

> Drop page cache hint is broken
> --
>
> Key: HBASE-17582
> URL: https://issues.apache.org/jira/browse/HBASE-17582
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, io
>Affects Versions: 2.0.0
>Reporter: Ashu Pachauri
>Assignee: Appy
>Priority: Critical
> Attachments: HBASE-17582.master.001.patch
>
>
> We pass a boolean for pass page cache drop hint while creating store file 
> scanners and writers. 
> The hint is not passed on down the stack by StoreFileWriter and 
> StoreFileScanner in the master branch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17590) Drop cache hint should work for StoreFile write path

2017-02-22 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17590:
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.3+.

> Drop cache hint should work for StoreFile write path
> 
>
> Key: HBASE-17590
> URL: https://issues.apache.org/jira/browse/HBASE-17590
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Ashu Pachauri
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17590.master.001.patch
>
>
> We have this in the code right now.
> {noformat}
> public Builder withShouldDropCacheBehind(boolean 
> shouldDropCacheBehind/*NOT USED!!*/) {
>   // TODO: HAS NO EFFECT!!! FIX!!
>   return this;
> }
> {noformat}
> Creating jira to track it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17590) Drop cache hint should work for StoreFile write path

2017-02-22 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879185#comment-15879185
 ] 

Gary Helmling commented on HBASE-17590:
---

+1.  I'll commit shortly.

> Drop cache hint should work for StoreFile write path
> 
>
> Key: HBASE-17590
> URL: https://issues.apache.org/jira/browse/HBASE-17590
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Ashu Pachauri
> Attachments: HBASE-17590.master.001.patch
>
>
> We have this in the code right now.
> {noformat}
> public Builder withShouldDropCacheBehind(boolean 
> shouldDropCacheBehind/*NOT USED!!*/) {
>   // TODO: HAS NO EFFECT!!! FIX!!
>   return this;
> }
> {noformat}
> Creating jira to track it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17627) Active workers metric for thrift

2017-02-15 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17627:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.3.1
   1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Pushed to branch-1.3+.  Thanks for the patch!

> Active workers metric for thrift
> 
>
> Key: HBASE-17627
> URL: https://issues.apache.org/jira/browse/HBASE-17627
> Project: HBase
>  Issue Type: Improvement
>  Components: Thrift
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17627.branch-1.001.patch, 
> HBASE-17627.master.001.patch
>
>
> It would be good to have a metric for number of active handlers on thrift 
> servers, which gives a good indication of business of a thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17627) Active workers metric for thrift

2017-02-14 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866560#comment-15866560
 ] 

Gary Helmling commented on HBASE-17627:
---

+1 on the patch from review board.

Can you check the test failure and see if it is related?

They way we currently expose number of active handlers, here and in the 
RpcServer, is a bit flawed.  Since it is a point in time metric, it can miss 
spikes in activity which clear before the metrics collection happen.  
Ultimately it might be better to report something like "number of handler 
milliseconds consumed" (millis duration for each call, summed up), but that is 
really a separate discussion.

Anyway, this change is much better than what we currently have, or don't have 
rather.

> Active workers metric for thrift
> 
>
> Key: HBASE-17627
> URL: https://issues.apache.org/jira/browse/HBASE-17627
> Project: HBase
>  Issue Type: Improvement
>  Components: Thrift
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Attachments: HBASE-17627.master.001.patch
>
>
> It would be good to have a metric for number of active handlers on thrift 
> servers, which gives a good indication of business of a thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds

2017-02-13 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17611:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.3+.  Thanks for the review, [~tedyu].

> Thrift 2 per-call latency metrics are capped at ~ 2 seconds
> ---
>
> Key: HBASE-17611
> URL: https://issues.apache.org/jira/browse/HBASE-17611
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17611.001.patch, HBASE-17611.002.patch
>
>
> Thrift 2 latency metrics are measured in nanoseconds.  However, the duration 
> used for per-method latencies is cast to an int, meaning the values are 
> capped at 2.147 seconds.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds

2017-02-10 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17611:
--
Attachment: HBASE-17611.002.patch

Thanks for taking a look [~tedyu].

The "testMetricsPrecision" table will get cleaned up with the rest of the test 
data anyway, but here's a new patch which deletes the table at the end of the 
test.

TestThriftServerCmdLine passes for me locally.  Let's see if it reappears in 
this run.



> Thrift 2 per-call latency metrics are capped at ~ 2 seconds
> ---
>
> Key: HBASE-17611
> URL: https://issues.apache.org/jira/browse/HBASE-17611
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17611.001.patch, HBASE-17611.002.patch
>
>
> Thrift 2 latency metrics are measured in nanoseconds.  However, the duration 
> used for per-method latencies is cast to an int, meaning the values are 
> capped at 2.147 seconds.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-02-09 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860173#comment-15860173
 ] 

Gary Helmling commented on HBASE-17579:
---

Tested this and it replaces the existing metric key with new metrics using the 
histogram suffixes.  We'll need to add a compatibility shim to get this in to 
1.3.  In addition to the histogram metrics, report the histogram max under the 
old metric key.  This should work for backwards compatibility.

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17579) Backport HBASE-16302 to 1.3.1

2017-02-09 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860149#comment-15860149
 ] 

Gary Helmling commented on HBASE-17579:
---

So, to confirm, this does not change the key for the existing 
ageOfLastShippedOp metric, but just changes it to use the histogram max 
internally, and adds metrics for the histogram percentiles?

> Backport HBASE-16302 to 1.3.1
> -
>
> Key: HBASE-17579
> URL: https://issues.apache.org/jira/browse/HBASE-17579
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17579.branch-1.3.001.patch
>
>
> This is a simple enough change to be included in 1.3.1, and replication 
> monitoring essentially breaks without this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17590) Drop cache hint should work for StoreFile write path

2017-02-09 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17590:
--
Status: Patch Available  (was: Open)

Trigger Hadoop QA run.

> Drop cache hint should work for StoreFile write path
> 
>
> Key: HBASE-17590
> URL: https://issues.apache.org/jira/browse/HBASE-17590
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Ashu Pachauri
> Attachments: HBASE-17590.master.001.patch
>
>
> We have this in the code right now.
> {noformat}
> public Builder withShouldDropCacheBehind(boolean 
> shouldDropCacheBehind/*NOT USED!!*/) {
>   // TODO: HAS NO EFFECT!!! FIX!!
>   return this;
> }
> {noformat}
> Creating jira to track it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1

2017-02-08 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17604:
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.3 and branch-1.  Thanks for taking a look [~stack]!

> Backport HBASE-15437 (fix request and response size metrics) to branch-1
> 
>
> Key: HBASE-17604
> URL: https://issues.apache.org/jira/browse/HBASE-17604
> Project: HBase
>  Issue Type: Bug
>  Components: IPC/RPC, metrics
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.4.0, 1.3.1
>
> Attachments: HBASE-17604.branch-1.001.patch
>
>
> HBASE-15437 fixed request and response size metrics in master.  We should 
> apply the same to branch-1 and related release branches.
> Prior to HBASE-15437, request and response size metrics were only calculated 
> based on the protobuf message serialized size.  This isn't correct when the 
> cell scanner payload is in use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds

2017-02-08 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17611:
--
Attachment: HBASE-17611.001.patch

Attaching a simple fix and test for coverage.

> Thrift 2 per-call latency metrics are capped at ~ 2 seconds
> ---
>
> Key: HBASE-17611
> URL: https://issues.apache.org/jira/browse/HBASE-17611
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17611.001.patch
>
>
> Thrift 2 latency metrics are measured in nanoseconds.  However, the duration 
> used for per-method latencies is cast to an int, meaning the values are 
> capped at 2.147 seconds.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds

2017-02-08 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17611:
--
Status: Patch Available  (was: Open)

> Thrift 2 per-call latency metrics are capped at ~ 2 seconds
> ---
>
> Key: HBASE-17611
> URL: https://issues.apache.org/jira/browse/HBASE-17611
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17611.001.patch
>
>
> Thrift 2 latency metrics are measured in nanoseconds.  However, the duration 
> used for per-method latencies is cast to an int, meaning the values are 
> capped at 2.147 seconds.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-02-07 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17381:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.2.5
   1.3.1
   1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.2+.  This would require some substantial rework for 
branch-1.1.

Thanks for the fix [~openinx]!

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5
>
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds

2017-02-07 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-17611:
-

 Summary: Thrift 2 per-call latency metrics are capped at ~ 2 
seconds
 Key: HBASE-17611
 URL: https://issues.apache.org/jira/browse/HBASE-17611
 Project: HBase
  Issue Type: Bug
  Components: metrics, Thrift
Reporter: Gary Helmling
Assignee: Gary Helmling
 Fix For: 1.3.1


Thrift 2 latency metrics are measured in nanoseconds.  However, the duration 
used for per-method latencies is cast to an int, meaning the values are capped 
at 2.147 seconds.  Let's use a long instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-02-07 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856716#comment-15856716
 ] 

Gary Helmling commented on HBASE-17381:
---

+1 on patch v3.  Thanks for the fix!  I'll commit shortly.

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1

2017-02-06 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17604:
--
Attachment: HBASE-17604.branch-1.001.patch

> Backport HBASE-15437 (fix request and response size metrics) to branch-1
> 
>
> Key: HBASE-17604
> URL: https://issues.apache.org/jira/browse/HBASE-17604
> Project: HBase
>  Issue Type: Bug
>  Components: IPC/RPC, metrics
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Attachments: HBASE-17604.branch-1.001.patch
>
>
> HBASE-15437 fixed request and response size metrics in master.  We should 
> apply the same to branch-1 and related release branches.
> Prior to HBASE-15437, request and response size metrics were only calculated 
> based on the protobuf message serialized size.  This isn't correct when the 
> cell scanner payload is in use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1

2017-02-06 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17604:
--
Status: Patch Available  (was: Open)

> Backport HBASE-15437 (fix request and response size metrics) to branch-1
> 
>
> Key: HBASE-17604
> URL: https://issues.apache.org/jira/browse/HBASE-17604
> Project: HBase
>  Issue Type: Bug
>  Components: IPC/RPC, metrics
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Attachments: HBASE-17604.branch-1.001.patch
>
>
> HBASE-15437 fixed request and response size metrics in master.  We should 
> apply the same to branch-1 and related release branches.
> Prior to HBASE-15437, request and response size metrics were only calculated 
> based on the protobuf message serialized size.  This isn't correct when the 
> cell scanner payload is in use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1

2017-02-06 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling reassigned HBASE-17604:
-

Assignee: Gary Helmling

> Backport HBASE-15437 (fix request and response size metrics) to branch-1
> 
>
> Key: HBASE-17604
> URL: https://issues.apache.org/jira/browse/HBASE-17604
> Project: HBase
>  Issue Type: Bug
>  Components: IPC/RPC, metrics
>Reporter: Gary Helmling
>Assignee: Gary Helmling
>
> HBASE-15437 fixed request and response size metrics in master.  We should 
> apply the same to branch-1 and related release branches.
> Prior to HBASE-15437, request and response size metrics were only calculated 
> based on the protobuf message serialized size.  This isn't correct when the 
> cell scanner payload is in use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1

2017-02-06 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-17604:
-

 Summary: Backport HBASE-15437 (fix request and response size 
metrics) to branch-1
 Key: HBASE-17604
 URL: https://issues.apache.org/jira/browse/HBASE-17604
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC, metrics
Reporter: Gary Helmling


HBASE-15437 fixed request and response size metrics in master.  We should apply 
the same to branch-1 and related release branches.

Prior to HBASE-15437, request and response size metrics were only calculated 
based on the protobuf message serialized size.  This isn't correct when the 
cell scanner payload is in use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-06 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17578:
--
Release Note: In prior versions, the HBase Thrift handlers failed to 
increment per-method metrics when an exception was encountered.  These metrics 
will now always be incremented, whether an exception is encountered or not.  
This change also adds exception-type metrics, similar to those exposed in 
regionservers, for individual exceptions which are received by the Thrift 
handlers.

> Thrift per-method metrics should still update in the case of exceptions
> ---
>
> Key: HBASE-17578
> URL: https://issues.apache.org/jira/browse/HBASE-17578
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17578.001.patch, HBASE-17578.002.patch, 
> HBASE-17578.002.patch
>
>
> Currently, the InvocationHandler used to update per-method metrics in the 
> Thrift server fails to update metrics if an exception occurs.  This causes us 
> to miss outliers.  We should include exceptional cases in per-method 
> latencies, and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-06 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17578:
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Applied to branch-1.3+.

I took a brief look at branch-1.2, but there are quite a few differences in the 
metrics classes, so I'm leaving it out there.  If there is interest in a more 
targeted fix for earlier branches, we can create a separate issue for a 
backport.

> Thrift per-method metrics should still update in the case of exceptions
> ---
>
> Key: HBASE-17578
> URL: https://issues.apache.org/jira/browse/HBASE-17578
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17578.001.patch, HBASE-17578.002.patch, 
> HBASE-17578.002.patch
>
>
> Currently, the InvocationHandler used to update per-method metrics in the 
> Thrift server fails to update metrics if an exception occurs.  This causes us 
> to miss outliers.  We should include exceptional cases in per-method 
> latencies, and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-02 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17578:
--
Attachment: HBASE-17578.002.patch

Make findbugs happy with equals and hashCode.

TestThriftHttpServer is flaky.

> Thrift per-method metrics should still update in the case of exceptions
> ---
>
> Key: HBASE-17578
> URL: https://issues.apache.org/jira/browse/HBASE-17578
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17578.001.patch, HBASE-17578.002.patch
>
>
> Currently, the InvocationHandler used to update per-method metrics in the 
> Thrift server fails to update metrics if an exception occurs.  This causes us 
> to miss outliers.  We should include exceptional cases in per-method 
> latencies, and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17582) Drop page cache hint is broken

2017-02-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850546#comment-15850546
 ] 

Gary Helmling commented on HBASE-17582:
---

On the write side, from what I can see, this was never fully hooked up.  The 
value set in the Builder doesn't look like it was ever passed through to the 
Writer constructor.  So the boolean needs to be passed through from the Writer 
Builder -> StoreFileWriter constructor -> WriterFactory.  If it is piped 
through, it will be used in WriterFactory.create().

> Drop page cache hint is broken
> --
>
> Key: HBASE-17582
> URL: https://issues.apache.org/jira/browse/HBASE-17582
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, io
>Affects Versions: 2.0.0
>Reporter: Ashu Pachauri
>Assignee: Appy
>Priority: Critical
> Attachments: HBASE-17582.master.001.patch
>
>
> We pass a boolean for pass page cache drop hint while creating store file 
> scanners and writers. 
> The hint is not passed on down the stack by StoreFileWriter and 
> StoreFileScanner in the master branch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17582) Drop page cache hint is broken

2017-02-02 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850347#comment-15850347
 ] 

Gary Helmling commented on HBASE-17582:
---

I took a look at the HBASE-15118 change to StoreFile.  It did remove the 
shouldDropCacheBehind private variable and replace it with the comment.  But it 
appears that the variable was unused even at that point.

On testing this, I don't really know.  It would be pretty difficult.  You could 
provide a way of overriding the HFileSystem used so that you can check that 
FSDataInputStream.setDropBehind(boolean) is eventually called with the right 
value.  But that would be pretty intrusive to do.  I'm not sure it's worth the 
effort unless anyone can see an easier way of validating this.

> Drop page cache hint is broken
> --
>
> Key: HBASE-17582
> URL: https://issues.apache.org/jira/browse/HBASE-17582
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, io
>Affects Versions: 2.0.0
>Reporter: Ashu Pachauri
>Assignee: Appy
>Priority: Critical
> Attachments: HBASE-17582.master.001.patch
>
>
> We pass a boolean for pass page cache drop hint while creating store file 
> scanners and writers. 
> The hint is not passed on down the stack by StoreFileWriter and 
> StoreFileScanner in the master branch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17582) Drop page cache hint is broken

2017-02-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849270#comment-15849270
 ] 

Gary Helmling commented on HBASE-17582:
---

On the reader side, it looks like this was broken in the commit for HBASE-15236.

In the changes to StoreFileScanner, we no longer pass through canUseDrop to 
StoreFile.createReader():
{noformat}
@@ -115,11 +136,13 @@ public class StoreFileScanner implements KeyValueScanner {
   ScanQueryMatcher matcher, long readPt, boolean isPrimaryReplica) throws 
IOException {
 List scanners = new ArrayList(
 files.size());
-for (StoreFile file : files) {
-  StoreFileReader r = file.createReader(canUseDrop);
+List sorted_files = new ArrayList<>(files);
+Collections.sort(sorted_files, StoreFile.Comparators.SEQ_ID);
+for (int i = 0; i < sorted_files.size(); i++) {
+  StoreFileReader r = sorted_files.get(i).createReader();
   r.setReplicaStoreFile(isPrimaryReplica);
   StoreFileScanner scanner = r.getStoreFileScanner(cacheBlocks, usePread,
-  isCompaction, readPt);
+  isCompaction, readPt, i);
   scanner.setScanQueryMatcher(matcher);
   scanners.add(scanner);
 }
{noformat}

Following the code through, it looks like just passing along canUseDrop to 
createReader() should fix it.


[~saint@gmail.com], [~apeksharma]: any idea if this change was intentional 
or an oversight?  It looks like it's also broken for StoreFileWriter.

> Drop page cache hint is broken
> --
>
> Key: HBASE-17582
> URL: https://issues.apache.org/jira/browse/HBASE-17582
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, io
>Affects Versions: 2.0.0
>Reporter: Ashu Pachauri
>
> We pass a boolean for pass page cache drop hint while creating store file 
> scanners and writers. 
> The hint is not passed on down the stack by StoreFileWriter and 
> StoreFileScanner in the master branch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17057) Minor compactions should also drop page cache behind reads

2017-02-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849235#comment-15849235
 ] 

Gary Helmling commented on HBASE-17057:
---

Since this is now based on configuration rather than overloading the compaction 
policy throttle threshold, the configuration keys need to be added to 
hbase-default.xml with a description.

It's also worth pointing out that defaulting to true for this on minor 
compactions is a change in behavior, which is questionable for a point release 
(if this goes in to 1.3.1), though I'm not sure when you would want the old 
behavior.

Can you double check that this actually is effective on master?  Looking at 
StoreFileScanner.getScannersForStoreFiles (line 126), I see canUseDrop as a 
param, but don't see it used anywhere...

> Minor compactions should also drop page cache behind reads
> --
>
> Key: HBASE-17057
> URL: https://issues.apache.org/jira/browse/HBASE-17057
> Project: HBase
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
> Fix For: 1.3.1
>
> Attachments: HBASE-17057.V1.patch
>
>
> Long compactions currently drop cache behind reads/writes so that they don't 
> pollute the page cache but short compactions don't do that. The bulk of the 
> data is actually read during minor compactions instead of major compactions,  
> and thrashes the page cache since it's mostly not needed. 
> We should drop page cache behind minor compactions too. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17346) Add coprocessor service support

2017-02-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849208#comment-15849208
 ] 

Gary Helmling commented on HBASE-17346:
---

Took a look after the fact.  +1

Looks good to me.  At first glance, I thought that passing a stubMaker function 
for every call would be a little cumbersome, but in practice I think it works 
well.  It's more intentional and not much longer than just giving the service 
class, and allows more flexibility if needed.

> Add coprocessor service support
> ---
>
> Key: HBASE-17346
> URL: https://issues.apache.org/jira/browse/HBASE-17346
> Project: HBase
>  Issue Type: Sub-task
>  Components: asyncclient, Client, Coprocessors
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0
>
> Attachments: 17346.suggestion.txt, HBASE-17346.patch, 
> HBASE-17346-v1.patch, HBASE-17346-v2.patch, HBASE-17346-v3.patch, 
> HBASE-17346-v4.patch, HBASE-17346-v5.patch
>
>
> I think we need to redesign the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-02-01 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849077#comment-15849077
 ] 

Gary Helmling commented on HBASE-17381:
---

[~openinx] on the current patches:

* since we are only aborting/stopping the regionserver, we can continue to use 
UncaughtExceptionHandler for this purpose.  Since we already create and attach 
an UncaughtExceptionHandler in ReplicationSourceWorkerThread.startup(), this 
seems like the right place to fix it.
* in the case of an OOME (as checked for in your initial patch), it seems fine 
to use Runtime.halt().  However this is pretty extreme in any other cases
* for other uncaught exceptions, it would be better to use 
Stoppable.stop(String reason).   A Stoppable instance (the regionserver) is 
passed through to ReplicationSourceManager.  We can use this instance to create 
a UEH that calls Stoppable.stop() if the exception we encounter is not a OOME.  
This will give regions a chance to close cleanly, etc. and speed recovery.

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> 

[jira] [Updated] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-01 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17578:
--
Attachment: HBASE-17578.001.patch

In addition to fixing the metrics InvocationHandlers for thrift 1 & 2 to always 
increment latencies, even in the case of exceptions, the attached patch also 
adds common exception handling and reporting for the thrift server 
implementations.  The metrics reporting for common exception types is now 
shared between regionservers, and thrift 1 & 2.

> Thrift per-method metrics should still update in the case of exceptions
> ---
>
> Key: HBASE-17578
> URL: https://issues.apache.org/jira/browse/HBASE-17578
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17578.001.patch
>
>
> Currently, the InvocationHandler used to update per-method metrics in the 
> Thrift server fails to update metrics if an exception occurs.  This causes us 
> to miss outliers.  We should include exceptional cases in per-method 
> latencies, and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-01 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17578:
--
Status: Patch Available  (was: Open)

> Thrift per-method metrics should still update in the case of exceptions
> ---
>
> Key: HBASE-17578
> URL: https://issues.apache.org/jira/browse/HBASE-17578
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 1.3.1
>
> Attachments: HBASE-17578.001.patch
>
>
> Currently, the InvocationHandler used to update per-method metrics in the 
> Thrift server fails to update metrics if an exception occurs.  This causes us 
> to miss outliers.  We should include exceptional cases in per-method 
> latencies, and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions

2017-02-01 Thread Gary Helmling (JIRA)
Gary Helmling created HBASE-17578:
-

 Summary: Thrift per-method metrics should still update in the case 
of exceptions
 Key: HBASE-17578
 URL: https://issues.apache.org/jira/browse/HBASE-17578
 Project: HBase
  Issue Type: Bug
  Components: Thrift
Reporter: Gary Helmling
Assignee: Gary Helmling
 Fix For: 1.3.1


Currently, the InvocationHandler used to update per-method metrics in the 
Thrift server fails to update metrics if an exception occurs.  This causes us 
to miss outliers.  We should include exceptional cases in per-method latencies, 
and also look at adding specific exception rate metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-26 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840344#comment-15840344
 ] 

Gary Helmling commented on HBASE-17381:
---

I sent a message to the dev list to see if others have strong opinions.  I 
generally favor aborting for any unhandled exception, but maybe there are cases 
I'm missing.



> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-19 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830926#comment-15830926
 ] 

Gary Helmling commented on HBASE-17381:
---

[~openinx]  Thanks for the patch.  This is an improvement.

Do you think it ever makes sense to let the ReplicationSourceWorkerThread die 
and not abort the regionserver?  In this case, replication will simply stop 
with little visibility into it happening.

It seems better to me to just always abort in the case of an unexpected 
Throwable thrown from runLoop().  Anything recoverable should be handled in a 
try/catch in runLoop itself.

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-17439) Make authentication Token retrieval amenable to coprocessor

2017-01-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816931#comment-15816931
 ] 

Gary Helmling commented on HBASE-17439:
---

That doesn't explain why the remote call needs to be done as the end user.  In 
general, we should not be using UGI.doAs() to carry through user identities.  
It is better to pass them explicitly.

> Make authentication Token retrieval amenable to coprocessor
> ---
>
> Key: HBASE-17439
> URL: https://issues.apache.org/jira/browse/HBASE-17439
> Project: HBase
>  Issue Type: Improvement
>  Components: Coprocessors, security
>Reporter: Ted Yu
>
> Here is snippet of stack trace from HBASE-17435:
> {code}
> at 
> org.apache.hadoop.hbase.backup.BackupObserver.preCommitStoreFile(BackupObserver.java:89)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$61.call(RegionCoprocessorHost.java:1494)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1692)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preCommitStoreFile(RegionCoprocessorHost.java:1490)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5512)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint$1.run(SecureBulkLoadEndpoint.java:293)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint$1.run(SecureBulkLoadEndpoint.java:276)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1704)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.secureBulkLoadHFiles(SecureBulkLoadEndpoint.java:276)
> {code}
> The ugi obtained from RPC on the server side does not contain required 
> Kerberos credentials to access hbase table. Hence the need to pass 
> authentication Token from region server onto the ugi
> In the course of solving HBASE-17435, [~jerryhe] and I noticed that it is 
> cumbersome for other coprocessor (such as SecureBulkLoadEndpoint) to retrieve 
> authentication Token from region server.
> Currently a Connection is needed to communicate with TokenProvider. Care is 
> needed not to introduce dead lock on the server side.
> This JIRA is to investigate feasibility of bypassing Connection / 
> TokenProvider in the retrieval of authentication Token for custom 
> coprocessor. This involves some refactoring around 
> AuthenticationTokenSecretManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-17439) Make authentication Token retrieval amenable to coprocessor

2017-01-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816875#comment-15816875
 ] 

Gary Helmling commented on HBASE-17439:
---

Why does HRegion.bulkLoadHFiles() need to run as the end user?  What is 
BackupObserver.preCommitStoreFile() doing and why does it need an auth token?  
Is it doing a remote call?

> Make authentication Token retrieval amenable to coprocessor
> ---
>
> Key: HBASE-17439
> URL: https://issues.apache.org/jira/browse/HBASE-17439
> Project: HBase
>  Issue Type: Improvement
>  Components: Coprocessors, security
>Reporter: Ted Yu
>
> Here is snippet of stack trace from HBASE-17435:
> {code}
> at 
> org.apache.hadoop.hbase.backup.BackupObserver.preCommitStoreFile(BackupObserver.java:89)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$61.call(RegionCoprocessorHost.java:1494)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1692)
> at 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preCommitStoreFile(RegionCoprocessorHost.java:1490)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5512)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint$1.run(SecureBulkLoadEndpoint.java:293)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint$1.run(SecureBulkLoadEndpoint.java:276)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1704)
> at 
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.secureBulkLoadHFiles(SecureBulkLoadEndpoint.java:276)
> {code}
> The ugi obtained from RPC on the server side does not contain required 
> Kerberos credentials to access hbase table. Hence the need to pass 
> authentication Token from region server onto the ugi
> In the course of solving HBASE-17435, [~jerryhe] and I noticed that it is 
> cumbersome for other coprocessor (such as SecureBulkLoadEndpoint) to retrieve 
> authentication Token from region server.
> Currently a Connection is needed to communicate with TokenProvider. Care is 
> needed not to introduce dead lock on the server side.
> This JIRA is to investigate feasibility of bypassing Connection / 
> TokenProvider in the retrieval of authentication Token for custom 
> coprocessor. This involves some refactoring around 
> AuthenticationTokenSecretManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-17439) Make authentication Token retrieval amenable to coprocessor

2017-01-10 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815871#comment-15815871
 ] 

Gary Helmling commented on HBASE-17439:
---

Can you explain a bit the use-case around why the coprocesor needs an 
authentication token?  The coprocessor is already running in process with the 
regionserver, meaning it has the regionservers krb credentials.  What is the 
authentication token used for?

> Make authentication Token retrieval amenable to coprocessor
> ---
>
> Key: HBASE-17439
> URL: https://issues.apache.org/jira/browse/HBASE-17439
> Project: HBase
>  Issue Type: Improvement
>  Components: Coprocessors, security
>Reporter: Ted Yu
>
> In the course of solving HBASE-17435, [~jerryhe] and I noticed that it is 
> cumbersome for other coprocessor (such as SecureBulkLoadEndpoint) to retrieve 
> authentication Token from region server.
> Currently a Connection is needed to communicate with TokenProvider. Care is 
> needed not to introduce dead lock on the server side.
> This JIRA is to investigate feasibility of bypassing Connection / 
> TokenProvider in the retrieval of authentication Token for custom 
> coprocessor. This involves some refactoring around 
> AuthenticationTokenSecretManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >