[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-12-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990932#comment-16990932
 ] 

Viraj Jasani commented on HBASE-22457:
--

{quote} # detect run-away refCount by comparing any reader's refCount with the 
actual number of open scanners (which we track for each HRegion) if the 
refCount is larger we know we have a problem.
 # (a variation) when we attempt to archive an HFile that has refCount, check 
if there're any open scanners, if not archive anyway.

For #1 at least we could enhance the logging and include the number of 
currently scanners in the log (where we say that we cannot archive an HFile)
{quote}
For #1, since open scanners are tracked at HRegion(RegionScannerImpl) and not 
at Store level, we might not be able to compare refCount with open scanners? 
Also, #2 might also not be true due to open scanners at region level 
(non-compacted store files) and not at store level? I was thinking if we can 
also track no of open scanners at store level.

I was just going though comments here while looking into HBASE-23349 (refCount 
1 preventing archival of compacted store files)

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-24 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847806#comment-16847806
 ] 

Andrew Purtell commented on HBASE-22457:


[~lhofhansl] do you want to really submit that patch? put this in PA status etc?

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846265#comment-16846265
 ] 

Andrew Purtell commented on HBASE-22457:


I opened subtask HBASE-22460 (Reopen a region if store reader references may 
have leaked)

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846262#comment-16846262
 ] 

Andrew Purtell commented on HBASE-22457:


Java offers Closable and AutoClosable interfaces that will trigger some 
checking behavior in the compiler, but offhand I think the static check is only 
useful if the resource is acquired and should be closed in the same method 
block.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846258#comment-16846258
 ] 

Lars Hofhansl commented on HBASE-22457:
---

> No, but this is no more or less fast then the close then open we do for 
> 'alter' processing. It would be implemented the same way, ideally. 

We came to that conclusion as well in a discussion in the office. Just alter 
some minor thing on the Table/ColumnDescriptor so that all regions are 
closed/reopened resulting "Wouldn't it be nice if we had a tool that could do 
that without forcing us to change something." :)

> Scanner wrapping is a key thing. Without it I don't think Phoenix works. 

Oh totally agree. Though perhaps it is structurally somehow possible to ensure 
that passed scanner is either wrapped or closed. I can't think of anything, 
though.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846253#comment-16846253
 ] 

Andrew Purtell commented on HBASE-22457:


bq. I do like your idea. Can we do a fast close without flushing the memstore? 
(Otherwise it might not be "fast" )

No, but this is no more or less fast then the close then open we do for 'alter' 
processing. It would be implemented the same way, ideally. 

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846252#comment-16846252
 ] 

Andrew Purtell commented on HBASE-22457:


bq. What I'm really looking for is a structural fix where a coprocessor cannot 
mess things. Perhaps that's not possible without severely limiting what 
coprocessors are allowed to do.

Scanner wrapping is a key thing. Without it I don't think Phoenix works. 

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846249#comment-16846249
 ] 

Andrew Purtell commented on HBASE-22457:


Random patch lgtm

One thought:
{code}
+  if (rsh.s != scanner) {
+LOG.warn("Scanner name " + scannerName + " does not match the actual 
scanner used!");
+rsh.s.close();
+  }
{code}
Worth implementing toString on scanner impl classes so we have potentially more 
than scannerName to go on when debugging this?

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846172#comment-16846172
 ] 

Lars Hofhansl commented on HBASE-22457:
---

Some random things that didn't quite look right.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846161#comment-16846161
 ] 

Lars Hofhansl commented on HBASE-22457:
---

Oh, it's a hack for sure and, as you point out, might hide the actual problem.

I do like your idea. Can we do a fast close without flushing the memstore? 
(Otherwise it might not be "fast" :))

Another hacks/ideas:
# detect run-away refCount by comparing any reader's refCount with the actual 
number of open scanners (which we track for each HRegion) if the refCount is 
larger we know we have a problem.
# (a variation) when we attempt to archive an HFile that has refCount, check if 
there're any open scanners, if not archive anyway.

For #1 at least we could enhance the logging and include the number of 
currently scanners in the log (where we say that we cannot archive an HFile)

What I'm really looking for is a structural fix where a coprocessor cannot mess 
things. Perhaps that's not possible without severely limiting what coprocessors 
are allowed to do.


> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846109#comment-16846109
 ] 

Andrew Purtell commented on HBASE-22457:


[~lhofhansl] 

bq. Are there any "safepoints"? I.e. points where we know what the reference 
could should be, so we can reset it? An obvious one is when there are no 
scanners running at all; in that case we could reset all refCounts to 0.

Interesting idea. Debating if it's a bit of a hack. In a refcounting system the 
number of resource acquisitions should equal the number of releases or else 
it's a bug. By hiding this class of bugs by behind the scenes twiddling with 
the counter would we encourage sloppy code? 

If considering hacks, a decent mitigation for leaks would be a fast close and 
reopen of the region on the same server (initiated by the RS) will release all 
resources, like the refcount, leases, etc. The clients should gracefully ride 
over this like any other region transition. If the refcount is over some 
ridiculous threshold this mitigation could be triggered along with a fat WARN 
in the logs. This is orthogonal to hardening though so if we want to pursue the 
idea I could open a subtask.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846100#comment-16846100
 ] 

Andrew Purtell commented on HBASE-22457:


bq. The Phoenix ones should be rare. In our case we are seeing readers with 
refCount > 1000.

This seems to be happening in conjunction with exceptions thrown when clients 
disconnect during the scan. So one action item here is to review exception 
handling in all scanner related paths in HBase and Phoenix and ensure 
references are released at one point in every exception chain. 

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846075#comment-16846075
 ] 

Lars Hofhansl commented on HBASE-22457:
---

One idea is: Are there any "safepoints"? I.e. points where we know what the 
reference could should be, so we can reset it? An obvious one is when there are 
no scanners running at all; in that case we could reset all refCounts to 0.


> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)