Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Stack
On Tue, Jul 24, 2018 at 10:27 AM Andrew Purtell  wrote:

> If we do this can we also move out hbck version 1? It would be really weird
> in my opinion to have v2 in a separate repo but v1 shipping with the 1.x
> releases. That would be a source of understandable confusion.
>
>
My sense is that hbck1 is not externalizable; we'd not be able to move it
out of core because it does dirty tricks all over the shop. But lets see...
S



> I believe our compatibility guidelines allow us to upgrade interface
> annotations from private to LP or Public and from LP to Public. These are
> not changes that impact source or binary compatibility. They only change
> the promises we make going forward about their stability. I believe we can
> allow these in new minors, so we could potentially move hbck out in a
> 1.5.0.
>
>
> On Mon, Jul 23, 2018 at 4:46 PM Stack  wrote:
>
> > On Thu, Jul 19, 2018 at 2:09 PM Umesh Agashe
>  > >
> > wrote:
> >
> > > Hi,
> > >
> > > I've had the opportunity to talk about HBCK2 with a few of you. One of
> > the
> > > suggestions is to to have a separate git repository for HBCK2. Lets
> > discuss
> > > about it.
> > >
> > > In the past when bugs were found in hbck, there is no easy way to
> release
> > > patched version of just hbck (without patching HBase). If HBCK2 has a
> > > separate git repo, HBCK2 versions will not be tightly related to HBase
> > > versions. Fixing and releasing hbck2, may not require patching HBase.
> > > Though tight coupling will be somewhat loosened, HBCK2 will still
> depend
> > on
> > > HBase APIs/ code. Caution will be required going forward regarding
> > > compatibility.
> > >
> > > What you all think?
> > >
> > >
> > I think this the way to go.
> >
> > We'd make a new hbase-hbck2 repo as we did for hbase-thirdparty?
> >
> > We'd use the hbase JIRA for hbase-hbck2 issues?
> >
> > We'd make hbase-hbck2 releases on occasion that the PMC voted on?
> >
> > Sounds great!
> > St.Ack
> >
> > Thanks,
> > > Umesh
> > >
> > > JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
> > > Doc:
> > >
> > >
> >
> https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>- A23, Crosstalk
>


Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Stack
On Tue, Jul 24, 2018 at 8:53 AM Josh Elser  wrote:

> (-cc user as this I'm getting purely into code development topics)
>
> First off, thanks for working on an hbck2, Umesh!
>
> I like the idea of having a separate repository for tracking HBCK and
> the flexibility it gives us for making releases at a cadence of our
> choosing.
>
> There are two worries that come to mind immediately:
>
> * How often does HBCK make decisions on how to implement a correction
> based on some known functionality (e.g. a bug) in a specific version(s)
> of HBase. Concretely, would HBCK need to make corrections to an HBase
> installation that are specific to a subset of HBase 2.x.y versions that
> may not be valid for other 2.x.y versions?
>


hbck2 should be able to do this -- execute a fix ONLY if version matches. I
add your suggestion to Umesh's attached doc.
S



> * How often does HBCK need to re-use methods and constants from code in
> hbase-common, hbase-server, etc?
>  - Related: Is it a goal to firm up API stability around this shared
> code or are you planning to just copy needed code to the HBCK2 repo? I
> think you are saying that this *is* a goal -- could/should we introduce
> some new level of InterfaceAudience to assert that we don't
> inadvertently break HBCK2?
>
> Thanks!
>
> On 7/19/18 5:09 PM, Umesh Agashe wrote:
> > Hi,
> >
> > I've had the opportunity to talk about HBCK2 with a few of you. One of
> the
> > suggestions is to to have a separate git repository for HBCK2. Lets
> discuss
> > about it.
> >
> > In the past when bugs were found in hbck, there is no easy way to release
> > patched version of just hbck (without patching HBase). If HBCK2 has a
> > separate git repo, HBCK2 versions will not be tightly related to HBase
> > versions. Fixing and releasing hbck2, may not require patching HBase.
> > Though tight coupling will be somewhat loosened, HBCK2 will still depend
> on
> > HBase APIs/ code. Caution will be required going forward regarding
> > compatibility.
> >
> > What you all think?
> >
> > Thanks,
> > Umesh
> >
> > JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
> > Doc:
> >
> https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing
> >
>


Re: [DISCUSS] Kafka Connection, HBASE-15320

2018-07-25 Thread Stack
On Tue, Jul 24, 2018 at 10:01 PM Misty Linville  wrote:

> I like the idea of a separate connectors repo/release vehicle, but I'm a
> little concerned about the need to release all together to update just one
> of the connectors. How would that work? What kind of compatibility
> guarantees are we signing up for?
>
>
I hate responses that begin "Good question" -- so fawning -- but, ahem,
good question Misty (in the literal, not flattering, sense).

I think hbase-connectors will be like hbase-thirdparty. The latter includes
netty, pb, guava and a few other bits and pieces so yeah, sometimes a netty
upgrade or an improvement on our patch to pb will require us releasing all
though we are fixing one lib only. Usually, if bothering to make a release,
we'll check for fixes or updates we can do in the other bundled components.

On the rate of releases, I foresee a flurry of activity around launch as we
fill missing bits and address critical bug fixes, but that then it will
settle down to be boring, with just the occasional update. Thrift and REST
have been stable for a good while now (not saying this is a good thing).
Our Sean just suggested moving mapreduce to connectors too -- an
interesting idea -- and this has also been stable too (at least until
recently with the shading work). We should talk about the Spark connector
when it comes time. It might not be as stable as the others.

On the compatibility guarantees, we'll semver it so if an incompatible
change in a connector or if the connectors have to change to match a new
version of hbase, we'll make sure the hbase-connector version number is
changed appropriately. On the backend, what Mike says; connectors use HBase
Public APIs (else they can't be moved to the hbase-connector repo).

S






> On Tue, Jul 24, 2018, 9:41 PM Stack  wrote:
>
> > Grand. I filed https://issues.apache.org/jira/browse/HBASE-20934. Let me
> > have a go at making the easy one work first (the kafka proxy). Lets see
> how
> > it goes. I'll report back here.
> > S
> >
> > On Tue, Jul 24, 2018 at 2:43 PM Sean Busbey  wrote:
> >
> > > Key functionality for the project's adoption should be in the project.
> > > Please do not suggest we donate things to Bahir.
> > >
> > > I apologize if this is brisk. I have had previous negative experiences
> > > with folks that span our communities trying to move work I spent a lot
> > > of time contributing to within HBase over to Bahir in an attempt to
> > > bypass an agreed upon standard of quality.
> > >
> > > On Tue, Jul 24, 2018 at 3:38 PM, Artem Ervits 
> > > wrote:
> > > > Why not just donating the connector to http://bahir.apache.org/ ?
> > > >
> > > > On Tue, Jul 24, 2018, 12:51 PM Lars Francke 
> > > wrote:
> > > >
> > > >> I'd love to have the Kafka Connector included.
> > > >>
> > > >> @Mike thanks so much for the contribution (and your planned ones)
> > > >>
> > > >> I'm +1 on adding it to the core but I'm also +1 on having a separate
> > > >> repository under Apache governance
> > > >>
> > > >> On Tue, Jul 24, 2018 at 6:01 PM, Josh Elser 
> > wrote:
> > > >>
> > > >> > +1 to the great point by Duo about use of non-IA.Public classes
> > > >> >
> > > >> > +1 for Apache for the governance (although, I wouldn't care if we
> > use
> > > >> > Github PRs to try to encourage more folks to contribute), a repo
> > with
> > > the
> > > >> > theme of "connectors" (to include Thrift, REST, and the like).
> Spark
> > > too
> > > >> --
> > > >> > I think we had suggested that prior, but it could be a mental
> > > invention
> > > >> of
> > > >> > mine..
> > > >> >
> > > >> >
> > > >> > On 7/24/18 10:16 AM, Hbase Janitor wrote:
> > > >> >
> > > >> >> Hi everyone,
> > > >> >>
> > > >> >> I'm the author of the patch.  A separate repo for all the
> > connectors
> > > is
> > > >> a
> > > >> >> great idea! I can make whatever changes necessary to the patch to
> > > help.
> > > >> >>
> > > >> >> I have several other integration type projects like this planned.
> > > >> >>
> > > >> >> Mike
> > > >> >>
> > > >> >>
> > > >> >> On Tue, Jul 24, 2018, 00:03 Mike Drob  wrote:
> > > >> >>
> > > >> >> I would be ok with all of the connectors in a single repo. Doing
> a
> > > repo
> > > >> >>> per
> > > >> >>> connector seems like a large amount of overhead work.
> > > >> >>>
> > > >> >>> On Mon, Jul 23, 2018, 9:12 PM Clay B.  wrote:
> > > >> >>>
> > > >> >>> [Non-binding]
> > > >> 
> > > >>  I am all for the Kafka Connect(er) as indeed it makes HBase
> "more
> > > >>  relevant" and generates buzz to help me sell HBase adoption in
> my
> > > >>  endeavors.
> > > >> 
> > > >>  Also, I would like to see a connectors repo a lot as I would
> > > expect it
> > > >> 
> > > >> >>> can
> > > >> >>>
> > > >>  make the HBase source and releases more obvious in what is
> > > changing.
> > > >> Not
> > > >>  to distract from Kafka, but Spark has in the past been a
> hang-up
> > > and
> > > >> 
> > > >> >>> seems
> > > >> >>>
> > > >>  a good 

Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Stack
On Wed, Jul 25, 2018 at 11:55 AM Josh Elser  wrote:

> ...
> My biggest take-away is that I complicated this document by tying it too
> closely with "HBase on Cloud", treating the WAL+Ratis LogService as the
> only/biggest thing to figure out. This was inaccurate and overly bold of
> me: I apologize. I think this complicated discussion on a number of
> points, and ate a good bit of some of your's time.
>
>
No need of apology.

There was healthy back and forth. You read the feedback and took it on
board.

(See below).



> My goal was to present this as an important part of a transition to the
> "cloud", giving justification to what WAL+Ratis helps HBase achieve. I
> did not want this document to be a step-by-step guide to a perfect HBase
> on Cloud design. I need to do a better job with this in the future; sorry.

That said, my feeling is that, on the whole, folks are in support of the
> proposed changes/architecture described for the WAL+Ratis work (tl;dr
> revisit WAL API, plug in current WAL implementation to any API
> modification, build new Ratis-backed WAL impl). There were some concerns
> which still need immediate action that I am aware of:
>
> * Sync with Ram and Anoop re: in-memory WAL [1]
> * Where is Ratis LogService metadata kept? How do we know what
> LogStreams were being used/maintained by a RS? How does this tie into
> recovery?
>
> There are also long-term concerns which I don't think I have an answer
> for yet (for either reasons out of my control or a lack of technical
> understanding):
>
> * Maturity of the Ratis community
> * Required performance by HBase and the ability of the LogService to
> provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging
> down disks, ability to scale RAFT quorums).
> * Continue with WAL-per-RS or move to WAL-per-Region? Related to perf,
> dependent upon Ratis scalability.
> * I/O amplification on WAL retention for backup and replication
> ("logstream export")
> * Ensure that LogStreams can be exported to a dist-filesystem in a
> manner which requires no additional metadata/handling (avoid more
> storage/mgmt complexity)
> * Ability to build krb5 authn into Ratis (really, gRPC)
>
> I will continue the two immediate action items. I think the latter
> concerns are some that will require fingers-on-keyboard -- I don't know
> enough about runtime characteristics without seeing it for myself.
>
> All this said, I'd like to start moving toward the point where we start
> breaking out this work into a feature-branch off of master and start
> building code. My hope is that this is amenable to everyone, with the
> acknowledge that the Ratis work is considered "experimental" and not an
> attempt to make all of HBase use Ratis-backed WALs.
>
>

Go for it.

The branch would have WAL API changes only or would it include Ratis WAL
dev? (If the latter, would that be better done over on Ratis project?).

S


> Finally, I do *not* want this message to be interpreted as me squashing
> anyone's concerns. My honest opinion is that discussion has died down,
> but I will be the first to apologize if I have missed any outstanding
> concerns. Please, please, please ping me if I am negligent.
>
> Thanks once again for everyone's participation.
>
> [1]
>
> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM
>
> On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,
> >
> > A long time ago, I shared a document about a (I'll call it..) "vision"
> > where we make some steps towards decoupling HBase from HDFS in an effort
> > to make deploying HBase on Cloud IaaS providers a bit easier
> > (operational simplicity, effective use of common IaaS paradigms, etc).
> >
> >
> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?usp=sharing
> >
> > A good ask from our Stack back then was: "[can you break down this
> > work]?" The original document was very high-level, and asking for some
> > more details make a lot of sense. Months later, I'd like to share that
> > I've updated the original document with some new content at the bottom
> > (as well as addressed some comments which went unanswered by me --
> sorry!)
> >
> > Based on a discussion I had earlier this week (and some discussions
> > during HBaseCon in California in June), I've tried to add a brief
> > "refresher" on what some of the big goals for this effort are. Please
> > check it out at your leisure and let me know what you think. Would like
> > to start getting some fingers behind this all and pump out some code :)
> >
> >
> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit#bookmark=id.fml9ynrqagk
> >
> > - Josh
> >
>


[jira] [Created] (HBASE-20943) Add offline/online region count into metrics

2018-07-25 Thread Tianying Chang (JIRA)
Tianying Chang created HBASE-20943:
--

 Summary: Add offline/online region count into metrics
 Key: HBASE-20943
 URL: https://issues.apache.org/jira/browse/HBASE-20943
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Affects Versions: 1.2.6.1, 2.0.0
Reporter: Tianying Chang


We intensively use metrics to monitor the health of our HBase production 
cluster. We have seen some regions of a table stuck and cannot be brought 
online due to AWS issue which cause some log file corrupted. It will be good if 
we can catch this early. Although WebUI has this information, it is not useful 
for automated monitoring. By adding this metric, we can easily monitor them 
with our monitoring system. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20942) Make RpcServer trace log length configurable

2018-07-25 Thread Mike Drob (JIRA)
Mike Drob created HBASE-20942:
-

 Summary: Make RpcServer trace log length configurable
 Key: HBASE-20942
 URL: https://issues.apache.org/jira/browse/HBASE-20942
 Project: HBase
  Issue Type: Task
Reporter: Esteban Gutierrez


We truncate RpcServer output to 1000 characters for trace logging. Would be 
better if that value was configurable.

Esteban mentioned this to me earlier, so I'm crediting him as the reporter.

cc: [~elserj]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Josh Elser

Thanks, Zach!

I like your suggestion about project updates. I sincerely hope that this 
can be something transparent enough that folks who want to follow-on and 
participate in implementation can do so. Let me think about how to drive 
this better.


On 7/25/18 3:55 PM, Zach York wrote:

+1 to starting the work. I think most of the concerns can be figured out on
the JIRAs and we can have a project update every X weeks if enough people
are interested.

I also agree to frame the feature correctly. Decoupling from a HDFS WAL or
WAL on Ratis would be more appropriate names that would better convey the
scope. I think there are a number of projects necessary to complete "HBase
on Cloud" with this being one of those.


Thanks for driving this initiative!

Zach


On Wed, Jul 25, 2018 at 11:55 AM, Josh Elser  wrote:


Let me give an update on-list for everyone:

First and foremost, thank you very much to everyone who took the time to
read this, with an extra thanks to those who participated in discussion.
There were lots of great points raised. Some about things that were unclear
in the doc, and others shining light onto subjects I hadn't considered yet.

My biggest take-away is that I complicated this document by tying it too
closely with "HBase on Cloud", treating the WAL+Ratis LogService as the
only/biggest thing to figure out. This was inaccurate and overly bold of
me: I apologize. I think this complicated discussion on a number of points,
and ate a good bit of some of your's time.

My goal was to present this as an important part of a transition to the
"cloud", giving justification to what WAL+Ratis helps HBase achieve. I did
not want this document to be a step-by-step guide to a perfect HBase on
Cloud design. I need to do a better job with this in the future; sorry.

That said, my feeling is that, on the whole, folks are in support of the
proposed changes/architecture described for the WAL+Ratis work (tl;dr
revisit WAL API, plug in current WAL implementation to any API
modification, build new Ratis-backed WAL impl). There were some concerns
which still need immediate action that I am aware of:

* Sync with Ram and Anoop re: in-memory WAL [1]
* Where is Ratis LogService metadata kept? How do we know what LogStreams
were being used/maintained by a RS? How does this tie into recovery?

There are also long-term concerns which I don't think I have an answer for
yet (for either reasons out of my control or a lack of technical
understanding):

* Maturity of the Ratis community
* Required performance by HBase and the ability of the LogService to
provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging down
disks, ability to scale RAFT quorums).
* Continue with WAL-per-RS or move to WAL-per-Region? Related to perf,
dependent upon Ratis scalability.
* I/O amplification on WAL retention for backup and replication
("logstream export")
* Ensure that LogStreams can be exported to a dist-filesystem in a manner
which requires no additional metadata/handling (avoid more storage/mgmt
complexity)
* Ability to build krb5 authn into Ratis (really, gRPC)

I will continue the two immediate action items. I think the latter
concerns are some that will require fingers-on-keyboard -- I don't know
enough about runtime characteristics without seeing it for myself.

All this said, I'd like to start moving toward the point where we start
breaking out this work into a feature-branch off of master and start
building code. My hope is that this is amenable to everyone, with the
acknowledge that the Ratis work is considered "experimental" and not an
attempt to make all of HBase use Ratis-backed WALs.

Finally, I do *not* want this message to be interpreted as me squashing
anyone's concerns. My honest opinion is that discussion has died down, but
I will be the first to apologize if I have missed any outstanding concerns.
Please, please, please ping me if I am negligent.

Thanks once again for everyone's participation.

[1] https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20Kb
SJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM

On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,



A long time ago, I shared a document about a (I'll call it..) "vision"
where we make some steps towards decoupling HBase from HDFS in an effort to
make deploying HBase on Cloud IaaS providers a bit easier (operational
simplicity, effective use of common IaaS paradigms, etc).

https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20Kb
SJwBHVxbO7ge5ORqbCk/edit?usp=sharing

A good ask from our Stack back then was: "[can you break down this
work]?" The original document was very high-level, and asking for some more
details make a lot of sense. Months later, I'd like to share that I've
updated the original document with some new content at the bottom (as well
as addressed some comments which went unanswered by me -- sorry!)

Based on a discussion I had earlier this week (and some discussions
during HBaseCon in California in June), I've tried to add a brief

Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Josh Elser
Thanks, Andrew. I was really upset that I was butting heads with you 
when I would have previously thought that I had a design which was in 
line with something you would have called "good".


I will wholly take the blame in not having an as-clear-as-possible 
design doc. I am way down in the weeds and didn't bring myself up for 
air before trying to write something consumable for everyone else.


Making a good API is my biggest goal for the HBase side, and my hope is 
that it will support this experiment, enable others who want to try out 
other systems, and simplify our existing WAL implementations.


Thanks for the reply.

On 7/25/18 3:50 PM, Andrew Purtell wrote:

My biggest take-away is that I complicated this document by tying it too  
closely

with "HBase on Cloud", treating the WAL+Ratis LogService as the  only/biggest
thing to figure out.

Understanding this now helps a lot to understand better the positions taken
from the doc. At first glance it read as an initially interesting document
that quickly went to a weird place where there was a preconceived solution
working backward toward a problem, engineering run in reverse. I think it's
perfectly fine if the Ratis podling and those associated with it want to
drive development and/or adoption by finding candidate use cases in other
ecosystem projects. As long as we have good interfaces which don't leak
internals, no breaking core changes, no hard dependencies on incubating
artifacts, and at least a potential path forward to alternate
implementations it's all good!

On Wed, Jul 25, 2018 at 11:55 AM Josh Elser  wrote:


Let me give an update on-list for everyone:

First and foremost, thank you very much to everyone who took the time to
read this, with an extra thanks to those who participated in discussion.
There were lots of great points raised. Some about things that were
unclear in the doc, and others shining light onto subjects I hadn't
considered yet.

My biggest take-away is that I complicated this document by tying it too
closely with "HBase on Cloud", treating the WAL+Ratis LogService as the
only/biggest thing to figure out. This was inaccurate and overly bold of
me: I apologize. I think this complicated discussion on a number of
points, and ate a good bit of some of your's time.

My goal was to present this as an important part of a transition to the
"cloud", giving justification to what WAL+Ratis helps HBase achieve. I
did not want this document to be a step-by-step guide to a perfect HBase
on Cloud design. I need to do a better job with this in the future; sorry.

That said, my feeling is that, on the whole, folks are in support of the
proposed changes/architecture described for the WAL+Ratis work (tl;dr
revisit WAL API, plug in current WAL implementation to any API
modification, build new Ratis-backed WAL impl). There were some concerns
which still need immediate action that I am aware of:

* Sync with Ram and Anoop re: in-memory WAL [1]
* Where is Ratis LogService metadata kept? How do we know what
LogStreams were being used/maintained by a RS? How does this tie into
recovery?

There are also long-term concerns which I don't think I have an answer
for yet (for either reasons out of my control or a lack of technical
understanding):

* Maturity of the Ratis community
* Required performance by HBase and the ability of the LogService to
provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging
down disks, ability to scale RAFT quorums).
* Continue with WAL-per-RS or move to WAL-per-Region? Related to perf,
dependent upon Ratis scalability.
* I/O amplification on WAL retention for backup and replication
("logstream export")
* Ensure that LogStreams can be exported to a dist-filesystem in a
manner which requires no additional metadata/handling (avoid more
storage/mgmt complexity)
* Ability to build krb5 authn into Ratis (really, gRPC)

I will continue the two immediate action items. I think the latter
concerns are some that will require fingers-on-keyboard -- I don't know
enough about runtime characteristics without seeing it for myself.

All this said, I'd like to start moving toward the point where we start
breaking out this work into a feature-branch off of master and start
building code. My hope is that this is amenable to everyone, with the
acknowledge that the Ratis work is considered "experimental" and not an
attempt to make all of HBase use Ratis-backed WALs.

Finally, I do *not* want this message to be interpreted as me squashing
anyone's concerns. My honest opinion is that discussion has died down,
but I will be the first to apologize if I have missed any outstanding
concerns. Please, please, please ping me if I am negligent.

Thanks once again for everyone's participation.

[1]

https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM

On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,


A long time ago, I shared a document about a (I'll call it..) "vision"
where we make 

Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Zach York
+1 to starting the work. I think most of the concerns can be figured out on
the JIRAs and we can have a project update every X weeks if enough people
are interested.

I also agree to frame the feature correctly. Decoupling from a HDFS WAL or
WAL on Ratis would be more appropriate names that would better convey the
scope. I think there are a number of projects necessary to complete "HBase
on Cloud" with this being one of those.


Thanks for driving this initiative!

Zach


On Wed, Jul 25, 2018 at 11:55 AM, Josh Elser  wrote:

> Let me give an update on-list for everyone:
>
> First and foremost, thank you very much to everyone who took the time to
> read this, with an extra thanks to those who participated in discussion.
> There were lots of great points raised. Some about things that were unclear
> in the doc, and others shining light onto subjects I hadn't considered yet.
>
> My biggest take-away is that I complicated this document by tying it too
> closely with "HBase on Cloud", treating the WAL+Ratis LogService as the
> only/biggest thing to figure out. This was inaccurate and overly bold of
> me: I apologize. I think this complicated discussion on a number of points,
> and ate a good bit of some of your's time.
>
> My goal was to present this as an important part of a transition to the
> "cloud", giving justification to what WAL+Ratis helps HBase achieve. I did
> not want this document to be a step-by-step guide to a perfect HBase on
> Cloud design. I need to do a better job with this in the future; sorry.
>
> That said, my feeling is that, on the whole, folks are in support of the
> proposed changes/architecture described for the WAL+Ratis work (tl;dr
> revisit WAL API, plug in current WAL implementation to any API
> modification, build new Ratis-backed WAL impl). There were some concerns
> which still need immediate action that I am aware of:
>
> * Sync with Ram and Anoop re: in-memory WAL [1]
> * Where is Ratis LogService metadata kept? How do we know what LogStreams
> were being used/maintained by a RS? How does this tie into recovery?
>
> There are also long-term concerns which I don't think I have an answer for
> yet (for either reasons out of my control or a lack of technical
> understanding):
>
> * Maturity of the Ratis community
> * Required performance by HBase and the ability of the LogService to
> provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging down
> disks, ability to scale RAFT quorums).
> * Continue with WAL-per-RS or move to WAL-per-Region? Related to perf,
> dependent upon Ratis scalability.
> * I/O amplification on WAL retention for backup and replication
> ("logstream export")
> * Ensure that LogStreams can be exported to a dist-filesystem in a manner
> which requires no additional metadata/handling (avoid more storage/mgmt
> complexity)
> * Ability to build krb5 authn into Ratis (really, gRPC)
>
> I will continue the two immediate action items. I think the latter
> concerns are some that will require fingers-on-keyboard -- I don't know
> enough about runtime characteristics without seeing it for myself.
>
> All this said, I'd like to start moving toward the point where we start
> breaking out this work into a feature-branch off of master and start
> building code. My hope is that this is amenable to everyone, with the
> acknowledge that the Ratis work is considered "experimental" and not an
> attempt to make all of HBase use Ratis-backed WALs.
>
> Finally, I do *not* want this message to be interpreted as me squashing
> anyone's concerns. My honest opinion is that discussion has died down, but
> I will be the first to apologize if I have missed any outstanding concerns.
> Please, please, please ping me if I am negligent.
>
> Thanks once again for everyone's participation.
>
> [1] https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20Kb
> SJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM
>
> On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,
>
>>
>> A long time ago, I shared a document about a (I'll call it..) "vision"
>> where we make some steps towards decoupling HBase from HDFS in an effort to
>> make deploying HBase on Cloud IaaS providers a bit easier (operational
>> simplicity, effective use of common IaaS paradigms, etc).
>>
>> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20Kb
>> SJwBHVxbO7ge5ORqbCk/edit?usp=sharing
>>
>> A good ask from our Stack back then was: "[can you break down this
>> work]?" The original document was very high-level, and asking for some more
>> details make a lot of sense. Months later, I'd like to share that I've
>> updated the original document with some new content at the bottom (as well
>> as addressed some comments which went unanswered by me -- sorry!)
>>
>> Based on a discussion I had earlier this week (and some discussions
>> during HBaseCon in California in June), I've tried to add a brief
>> "refresher" on what some of the big goals for this effort are. Please check
>> it out at your leisure and let me know what 

Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Andrew Purtell
> My biggest take-away is that I complicated this document by tying it too  
> closely
with "HBase on Cloud", treating the WAL+Ratis LogService as the  only/biggest
thing to figure out.

Understanding this now helps a lot to understand better the positions taken
from the doc. At first glance it read as an initially interesting document
that quickly went to a weird place where there was a preconceived solution
working backward toward a problem, engineering run in reverse. I think it's
perfectly fine if the Ratis podling and those associated with it want to
drive development and/or adoption by finding candidate use cases in other
ecosystem projects. As long as we have good interfaces which don't leak
internals, no breaking core changes, no hard dependencies on incubating
artifacts, and at least a potential path forward to alternate
implementations it's all good!

On Wed, Jul 25, 2018 at 11:55 AM Josh Elser  wrote:

> Let me give an update on-list for everyone:
>
> First and foremost, thank you very much to everyone who took the time to
> read this, with an extra thanks to those who participated in discussion.
> There were lots of great points raised. Some about things that were
> unclear in the doc, and others shining light onto subjects I hadn't
> considered yet.
>
> My biggest take-away is that I complicated this document by tying it too
> closely with "HBase on Cloud", treating the WAL+Ratis LogService as the
> only/biggest thing to figure out. This was inaccurate and overly bold of
> me: I apologize. I think this complicated discussion on a number of
> points, and ate a good bit of some of your's time.
>
> My goal was to present this as an important part of a transition to the
> "cloud", giving justification to what WAL+Ratis helps HBase achieve. I
> did not want this document to be a step-by-step guide to a perfect HBase
> on Cloud design. I need to do a better job with this in the future; sorry.
>
> That said, my feeling is that, on the whole, folks are in support of the
> proposed changes/architecture described for the WAL+Ratis work (tl;dr
> revisit WAL API, plug in current WAL implementation to any API
> modification, build new Ratis-backed WAL impl). There were some concerns
> which still need immediate action that I am aware of:
>
> * Sync with Ram and Anoop re: in-memory WAL [1]
> * Where is Ratis LogService metadata kept? How do we know what
> LogStreams were being used/maintained by a RS? How does this tie into
> recovery?
>
> There are also long-term concerns which I don't think I have an answer
> for yet (for either reasons out of my control or a lack of technical
> understanding):
>
> * Maturity of the Ratis community
> * Required performance by HBase and the ability of the LogService to
> provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging
> down disks, ability to scale RAFT quorums).
> * Continue with WAL-per-RS or move to WAL-per-Region? Related to perf,
> dependent upon Ratis scalability.
> * I/O amplification on WAL retention for backup and replication
> ("logstream export")
> * Ensure that LogStreams can be exported to a dist-filesystem in a
> manner which requires no additional metadata/handling (avoid more
> storage/mgmt complexity)
> * Ability to build krb5 authn into Ratis (really, gRPC)
>
> I will continue the two immediate action items. I think the latter
> concerns are some that will require fingers-on-keyboard -- I don't know
> enough about runtime characteristics without seeing it for myself.
>
> All this said, I'd like to start moving toward the point where we start
> breaking out this work into a feature-branch off of master and start
> building code. My hope is that this is amenable to everyone, with the
> acknowledge that the Ratis work is considered "experimental" and not an
> attempt to make all of HBase use Ratis-backed WALs.
>
> Finally, I do *not* want this message to be interpreted as me squashing
> anyone's concerns. My honest opinion is that discussion has died down,
> but I will be the first to apologize if I have missed any outstanding
> concerns. Please, please, please ping me if I am negligent.
>
> Thanks once again for everyone's participation.
>
> [1]
>
> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM
>
> On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,
> >
> > A long time ago, I shared a document about a (I'll call it..) "vision"
> > where we make some steps towards decoupling HBase from HDFS in an effort
> > to make deploying HBase on Cloud IaaS providers a bit easier
> > (operational simplicity, effective use of common IaaS paradigms, etc).
> >
> >
> https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?usp=sharing
> >
> > A good ask from our Stack back then was: "[can you break down this
> > work]?" The original document was very high-level, and asking for some
> > more details make a lot of sense. Months later, I'd like to share that
> > I've 

[jira] [Created] (HBASE-20941) Cre

2018-07-25 Thread Umesh Agashe (JIRA)
Umesh Agashe created HBASE-20941:


 Summary: Cre
 Key: HBASE-20941
 URL: https://issues.apache.org/jira/browse/HBASE-20941
 Project: HBase
  Issue Type: Sub-task
Reporter: Umesh Agashe
Assignee: Umesh Agashe


Create HbckService in master and implement following methods:
 # purgeProcedure/s(): some procedures do not support abort at every step. When 
these procedures get stuck then they can not be aborted or make further 
progress. Corrective action is to purge these procedures from ProcWAL. Provide 
option to purge sub-procedures as well.
 # setTable/RegionState(): If table/ region state are inconsistent with action/ 
procedures working on them, sometimes manipulating their states in meta fix 
things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Kafka Connection, HBASE-15320

2018-07-25 Thread Hbase Janitor
Hi Misty,

As long as the connectors use a public API, we can be flexible.  We get the
same guarantees app programmers get.

Mike

On Wed, Jul 25, 2018, 01:01 Misty Linville  wrote:

> I like the idea of a separate connectors repo/release vehicle, but I'm a
> little concerned about the need to release all together to update just one
> of the connectors. How would that work? What kind of compatiy
> guarantees are we signing up for?
>
>
>


Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Umesh Agashe
bq. Seems like you're saying it's not a problem now, but
you're not sure if it would become a problem. Regardless of that, it's a
goal to not be version-specific (and thus, we can have generic hbck-v1
and hbck-v2 tools). LMK if I misread, please :)

Thats right.

On Wed, Jul 25, 2018 at 11:11 AM Josh Elser  wrote:

> Thanks, Umesh. Seems like you're saying it's not a problem now, but
> you're not sure if it would become a problem. Regardless of that, it's a
> goal to not be version-specific (and thus, we can have generic hbck-v1
> and hbck-v2 tools). LMK if I misread, please :)
>
> One more thought, it would be nice to name this repository as
> "operator-tools" or similar (instead of hbck). A separate repo on its
> own release cadence is a nice vehicle for random sorts of recovery,
> slice-and-dice, one-off tools. I think HBCK is one example of
> administrator/operator tooling we provide (certainly, the most used),
> but we have the capacity to provide more than just that.
>
> On 7/24/18 5:55 PM, Umesh Agashe wrote:
> > Thanks Stack, Josh and Andrew for your suggestions and concerns.
> >
> > I share Stack's suggestions. This would be similar to hbase-thirdparty.
> The
> > new repo could be hbase-hbck/hbase-hbck2. As this tool will be used by
> > hbase users/ developers, hbase JIRA can be used for hbck issues.
> >
> > bq. How often does HBCK need to re-use methods and constants from code
> > in hbase-common, hbase-server, etc?
> > bq. Is it a goal to firm up API stability around this shared code.
> >
> > bq. If we do this can we also move out hbck version 1?
> >
> > As HBCK2 tool will be freshly written, we can try to achieve this goal. I
> > think its great idea to move hbck1 to new repo as well. Though I think
> its
> > more involved with hbck1 as the existing code already uses what it can
> from
> > hbase-common and hbase-server etc. modules.
> >
> > bq. How often does HBCK make decisions on how to implement a correction
> > based on some known functionality (e.g. a bug) in a specific version(s)
> > of HBase. Concretely, would HBCK need to make corrections to an HBase
> > installation that are specific to a subset of HBase 2.x.y versions that
> > may not be valid for other 2.x.y versions?
> >
> > I see if this happens too often, compatibility metrics will be
> complicated.
> >
> > Thanks,
> > Umesh
> >
> >
> > On Tue, Jul 24, 2018 at 10:27 AM Andrew Purtell 
> wrote:
> >
> >> If we do this can we also move out hbck version 1? It would be really
> weird
> >> in my opinion to have v2 in a separate repo but v1 shipping with the 1.x
> >> releases. That would be a source of understandable confusion.
> >>
> >> I believe our compatibility guidelines allow us to upgrade interface
> >> annotations from private to LP or Public and from LP to Public. These
> are
> >> not changes that impact source or binary compatibility. They only change
> >> the promises we make going forward about their stability. I believe we
> can
> >> allow these in new minors, so we could potentially move hbck out in a
> >> 1.5.0.
> >>
> >>
> >> On Mon, Jul 23, 2018 at 4:46 PM Stack  wrote:
> >>
> >>> On Thu, Jul 19, 2018 at 2:09 PM Umesh Agashe
> >>  
> >>> wrote:
> >>>
>  Hi,
> 
>  I've had the opportunity to talk about HBCK2 with a few of you. One of
> >>> the
>  suggestions is to to have a separate git repository for HBCK2. Lets
> >>> discuss
>  about it.
> 
>  In the past when bugs were found in hbck, there is no easy way to
> >> release
>  patched version of just hbck (without patching HBase). If HBCK2 has a
>  separate git repo, HBCK2 versions will not be tightly related to HBase
>  versions. Fixing and releasing hbck2, may not require patching HBase.
>  Though tight coupling will be somewhat loosened, HBCK2 will still
> >> depend
> >>> on
>  HBase APIs/ code. Caution will be required going forward regarding
>  compatibility.
> 
>  What you all think?
> 
> 
> >>> I think this the way to go.
> >>>
> >>> We'd make a new hbase-hbck2 repo as we did for hbase-thirdparty?
> >>>
> >>> We'd use the hbase JIRA for hbase-hbck2 issues?
> >>>
> >>> We'd make hbase-hbck2 releases on occasion that the PMC voted on?
> >>>
> >>> Sounds great!
> >>> St.Ack
> >>>
> >>> Thanks,
>  Umesh
> 
>  JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
>  Doc:
> 
> 
> >>>
> >>
> https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing
> 
> >>>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from truth's
> >> decrepit hands
> >> - A23, Crosstalk
> >>
> >
>


Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Umesh Agashe
Thanks Josh! separate 'operator-tools' repo for hbase tools is a great
suggestion. We can work towards it starting with hbck2. Each existing tool
needs to be looked in detail regarding how much code it shares with HBase.

On Wed, Jul 25, 2018 at 11:11 AM Josh Elser  wrote:

> Thanks, Umesh. Seems like you're saying it's not a problem now, but
> you're not sure if it would become a problem. Regardless of that, it's a
> goal to not be version-specific (and thus, we can have generic hbck-v1
> and hbck-v2 tools). LMK if I misread, please :)
>
> One more thought, it would be nice to name this repository as
> "operator-tools" or similar (instead of hbck). A separate repo on its
> own release cadence is a nice vehicle for random sorts of recovery,
> slice-and-dice, one-off tools. I think HBCK is one example of
> administrator/operator tooling we provide (certainly, the most used),
> but we have the capacity to provide more than just that.
>
> On 7/24/18 5:55 PM, Umesh Agashe wrote:
> > Thanks Stack, Josh and Andrew for your suggestions and concerns.
> >
> > I share Stack's suggestions. This would be similar to hbase-thirdparty.
> The
> > new repo could be hbase-hbck/hbase-hbck2. As this tool will be used by
> > hbase users/ developers, hbase JIRA can be used for hbck issues.
> >
> > bq. How often does HBCK need to re-use methods and constants from code
> > in hbase-common, hbase-server, etc?
> > bq. Is it a goal to firm up API stability around this shared code.
> >
> > bq. If we do this can we also move out hbck version 1?
> >
> > As HBCK2 tool will be freshly written, we can try to achieve this goal. I
> > think its great idea to move hbck1 to new repo as well. Though I think
> its
> > more involved with hbck1 as the existing code already uses what it can
> from
> > hbase-common and hbase-server etc. modules.
> >
> > bq. How often does HBCK make decisions on how to implement a correction
> > based on some known functionality (e.g. a bug) in a specific version(s)
> > of HBase. Concretely, would HBCK need to make corrections to an HBase
> > installation that are specific to a subset of HBase 2.x.y versions that
> > may not be valid for other 2.x.y versions?
> >
> > I see if this happens too often, compatibility metrics will be
> complicated.
> >
> > Thanks,
> > Umesh
> >
> >
> > On Tue, Jul 24, 2018 at 10:27 AM Andrew Purtell 
> wrote:
> >
> >> If we do this can we also move out hbck version 1? It would be really
> weird
> >> in my opinion to have v2 in a separate repo but v1 shipping with the 1.x
> >> releases. That would be a source of understandable confusion.
> >>
> >> I believe our compatibility guidelines allow us to upgrade interface
> >> annotations from private to LP or Public and from LP to Public. These
> are
> >> not changes that impact source or binary compatibility. They only change
> >> the promises we make going forward about their stability. I believe we
> can
> >> allow these in new minors, so we could potentially move hbck out in a
> >> 1.5.0.
> >>
> >>
> >> On Mon, Jul 23, 2018 at 4:46 PM Stack  wrote:
> >>
> >>> On Thu, Jul 19, 2018 at 2:09 PM Umesh Agashe
> >>  
> >>> wrote:
> >>>
>  Hi,
> 
>  I've had the opportunity to talk about HBCK2 with a few of you. One of
> >>> the
>  suggestions is to to have a separate git repository for HBCK2. Lets
> >>> discuss
>  about it.
> 
>  In the past when bugs were found in hbck, there is no easy way to
> >> release
>  patched version of just hbck (without patching HBase). If HBCK2 has a
>  separate git repo, HBCK2 versions will not be tightly related to HBase
>  versions. Fixing and releasing hbck2, may not require patching HBase.
>  Though tight coupling will be somewhat loosened, HBCK2 will still
> >> depend
> >>> on
>  HBase APIs/ code. Caution will be required going forward regarding
>  compatibility.
> 
>  What you all think?
> 
> 
> >>> I think this the way to go.
> >>>
> >>> We'd make a new hbase-hbck2 repo as we did for hbase-thirdparty?
> >>>
> >>> We'd use the hbase JIRA for hbase-hbck2 issues?
> >>>
> >>> We'd make hbase-hbck2 releases on occasion that the PMC voted on?
> >>>
> >>> Sounds great!
> >>> St.Ack
> >>>
> >>> Thanks,
>  Umesh
> 
>  JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
>  Doc:
> 
> 
> >>>
> >>
> https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing
> 
> >>>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from truth's
> >> decrepit hands
> >> - A23, Crosstalk
> >>
> >
>


Re: [DISCUSS] Expanded "work items" for HBase-in-the-Cloud doc

2018-07-25 Thread Josh Elser

Let me give an update on-list for everyone:

First and foremost, thank you very much to everyone who took the time to 
read this, with an extra thanks to those who participated in discussion. 
There were lots of great points raised. Some about things that were 
unclear in the doc, and others shining light onto subjects I hadn't 
considered yet.


My biggest take-away is that I complicated this document by tying it too 
closely with "HBase on Cloud", treating the WAL+Ratis LogService as the 
only/biggest thing to figure out. This was inaccurate and overly bold of 
me: I apologize. I think this complicated discussion on a number of 
points, and ate a good bit of some of your's time.


My goal was to present this as an important part of a transition to the 
"cloud", giving justification to what WAL+Ratis helps HBase achieve. I 
did not want this document to be a step-by-step guide to a perfect HBase 
on Cloud design. I need to do a better job with this in the future; sorry.


That said, my feeling is that, on the whole, folks are in support of the 
proposed changes/architecture described for the WAL+Ratis work (tl;dr 
revisit WAL API, plug in current WAL implementation to any API 
modification, build new Ratis-backed WAL impl). There were some concerns 
which still need immediate action that I am aware of:


* Sync with Ram and Anoop re: in-memory WAL [1]
* Where is Ratis LogService metadata kept? How do we know what 
LogStreams were being used/maintained by a RS? How does this tie into 
recovery?


There are also long-term concerns which I don't think I have an answer 
for yet (for either reasons out of my control or a lack of technical 
understanding):


* Maturity of the Ratis community
* Required performance by HBase and the ability of the LogService to 
provide that perf (Areas already mentioned: gRPC perf, fsyncs bogging 
down disks, ability to scale RAFT quorums).
* Continue with WAL-per-RS or move to WAL-per-Region? Related to perf, 
dependent upon Ratis scalability.
* I/O amplification on WAL retention for backup and replication 
("logstream export")
* Ensure that LogStreams can be exported to a dist-filesystem in a 
manner which requires no additional metadata/handling (avoid more 
storage/mgmt complexity)

* Ability to build krb5 authn into Ratis (really, gRPC)

I will continue the two immediate action items. I think the latter 
concerns are some that will require fingers-on-keyboard -- I don't know 
enough about runtime characteristics without seeing it for myself.


All this said, I'd like to start moving toward the point where we start 
breaking out this work into a feature-branch off of master and start 
building code. My hope is that this is amenable to everyone, with the 
acknowledge that the Ratis work is considered "experimental" and not an 
attempt to make all of HBase use Ratis-backed WALs.


Finally, I do *not* want this message to be interpreted as me squashing 
anyone's concerns. My honest opinion is that discussion has died down, 
but I will be the first to apologize if I have missed any outstanding 
concerns. Please, please, please ping me if I am negligent.


Thanks once again for everyone's participation.

[1] 
https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?disco=CBm3RLM


On 2018/07/13 20:15:45, Josh Elser  wrote: > Hi all,


A long time ago, I shared a document about a (I'll call it..) "vision" 
where we make some steps towards decoupling HBase from HDFS in an effort 
to make deploying HBase on Cloud IaaS providers a bit easier 
(operational simplicity, effective use of common IaaS paradigms, etc).


https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit?usp=sharing

A good ask from our Stack back then was: "[can you break down this 
work]?" The original document was very high-level, and asking for some 
more details make a lot of sense. Months later, I'd like to share that 
I've updated the original document with some new content at the bottom 
(as well as addressed some comments which went unanswered by me -- sorry!)


Based on a discussion I had earlier this week (and some discussions 
during HBaseCon in California in June), I've tried to add a brief 
"refresher" on what some of the big goals for this effort are. Please 
check it out at your leisure and let me know what you think. Would like 
to start getting some fingers behind this all and pump out some code :)


https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit#bookmark=id.fml9ynrqagk

- Josh



Prep for release candidate 1.5.0 RC0

2018-07-25 Thread Andrew Purtell
I would like to put up the first release candidate for 1.5.0 by the end of
August. To that end over the next couple of weeks I will be evaluating test
stability, cluster stability under chaos testing, and performance
differences (if any) with the latest 1.2, 1.3, and 1.4 releases as measured
by the open source benchmarking tools at our disposal, PE, LTT, and YCSB.

​If you have any backport work to branch-1 pending please consider
finishing it up and getting it in within the next couple of weeks. However,
if the changes are likely to have a significant impact (for example, it
conforms to compatibility guidelines for a minor release, but not a patch
release) then you might want to hold off until after branch-1.5 has been
branched, so it can go into branch-1 for a 1.6.0 release toward the end of
the year. Use your best judgement is all I ask.

-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


[VOTE] The first HBase 1.4.6 release candidate (RC0) is available

2018-07-25 Thread Andrew Purtell
The first HBase 1.4.6 release candidate (RC0) is available for download at
https://dist.apache.org/repos/dist/dev/hbase/hbase-1.4.6RC0/ and Maven
artifacts are available in the temporary repository
https://repository.apache.org/content/repositories/orgapachehbase-1226/ .

The git tag corresponding to the candidate is '1.4.6RC0' (a55bcbd4fc).

A detailed source and binary compatibility report for this release is
available for your review at
https://dist.apache.org/repos/dist/dev/hbase/hbase-1.4.6RC0/compat-check-report.html
. There is an added method to the LimitedPrivate interface ReplicationPeer
which will not cause binary compatibility issues but will require source
changes at recompilation. This type of additive change is allowed. The
internal utility class Base64 has been made private and so the related
changes are allowed.

A list of the 34 issues resolved in this release can be found at
https://s.apache.org/kolm .

Please try out the candidate and vote +1/0/-1.

This vote will be open for at least 72 hours. Unless objection I will try
to close it Monday July 30, 2018 if we have sufficient votes.

Prior to making this announcement I made the following preflight checks:

RAT check passes (7u80)
Unit test suite passes (7u80)
LTT load 1M rows with 100% verification and 20% updates (8u172)
ITBLL Loop 1 500M rows with serverKilling monkey (8u172)


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


[jira] [Reopened] (HBASE-20893) Data loss if splitting region while ServerCrashProcedure executing

2018-07-25 Thread stack (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reopened HBASE-20893:
---

Reopening to look at these logs I see running this patch on cluster (Its great 
it detected recovered.edits... but it looks like the patch causes us to hit 
CODE-BUG...  though we seem to be ok...Minimally it will freak-out an operator):

{code}

2018-07-25 06:46:56,692 ERROR [PEWorker-3] 
assignment.SplitTableRegionProcedure: Error trying to split region 
2cb977a87bc6bdf90ef7fc71320d7b50 in the table IntegrationTestBigLinkedList (in 
state=SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS)
java.io.IOException: Recovered.edits are found in Region: {ENCODED => 
2cb977a87bc6bdf90ef7fc71320d7b50, NAME => 
'IntegrationTestBigLinkedList,z\xAA;\xC7M\x1Bf8\x85\xB5\x07\xD5\x9B#\xCD\xCC,1531911202047.2cb977a87bc6bdf90ef7fc71320d7b50.',
 STARTKEY => 'z\xAA;\xC7M\x1Bf8\x85\xB5\x07\xD5\x9B#\xCD\xCC', ENDKEY => 
'{\x8D\xF2?'}, abort split to prevent data loss
  at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkClosedRegion(SplitTableRegionProcedure.java:151)
  at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:259)
  at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:92)
  at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)

   at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)

 2018-07-25 06:46:56,934 INFO  [PEWorker-3] 
procedure.MasterProcedureScheduler: pid=4106, ppid=4105, state=SUCCESS; 
UnassignProcedure table=IntegrationTestBigLinkedList, 
region=2cb977a87bc6bdf90ef7fc71320d7b50, 
server=ve0540.halxg.cloudera.com,16020,1532501580658 checking lock on 
2cb977a87bc6bdf90ef7fc71320d7b50

2018-07-25 06:46:56,934 ERROR [PEWorker-3] procedure2.ProcedureExecutor: 
CODE-BUG: Uncaught runtime exception for pid=4106, ppid=4105, state=SUCCESS; 
UnassignProcedure table=IntegrationTestBigLinkedList, 
region=2cb977a87bc6bdf90ef7fc71320d7b50, 
server=ve0540.halxg.cloudera.com,16020,1532501580658

   java.lang.UnsupportedOperationException: 
Unhandled state REGION_TRANSITION_FINISH; there is no rollback for assignment 
unless we cancel the operation by dropping/disabling the table
  at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.rollback(RegionTransitionProcedure.java:412)
  at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.rollback(RegionTransitionProcedure.java:95)

  at 
org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1372)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1328)

at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1197)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)

   at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
2018-07-25 06:46:57,088 ERROR [PEWorker-3] procedure2.ProcedureExecutor: 
CODE-BUG: Uncaught runtime exception for pid=4106, ppid=4105, state=SUCCESS; 
UnassignProcedure table=IntegrationTestBigLinkedList, 
region=2cb977a87bc6bdf90ef7fc71320d7b50, 
server=ve0540.halxg.cloudera.com,16020,1532501580658

   java.lang.UnsupportedOperationException: 

Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Andrew Purtell
Yes, and in that vein also VerifyReplication and tools of that nature.


On Wed, Jul 25, 2018 at 11:11 AM Josh Elser  wrote:

> Thanks, Umesh. Seems like you're saying it's not a problem now, but
> you're not sure if it would become a problem. Regardless of that, it's a
> goal to not be version-specific (and thus, we can have generic hbck-v1
> and hbck-v2 tools). LMK if I misread, please :)
>
> One more thought, it would be nice to name this repository as
> "operator-tools" or similar (instead of hbck). A separate repo on its
> own release cadence is a nice vehicle for random sorts of recovery,
> slice-and-dice, one-off tools. I think HBCK is one example of
> administrator/operator tooling we provide (certainly, the most used),
> but we have the capacity to provide more than just that.
>
> On 7/24/18 5:55 PM, Umesh Agashe wrote:
> > Thanks Stack, Josh and Andrew for your suggestions and concerns.
> >
> > I share Stack's suggestions. This would be similar to hbase-thirdparty.
> The
> > new repo could be hbase-hbck/hbase-hbck2. As this tool will be used by
> > hbase users/ developers, hbase JIRA can be used for hbck issues.
> >
> > bq. How often does HBCK need to re-use methods and constants from code
> > in hbase-common, hbase-server, etc?
> > bq. Is it a goal to firm up API stability around this shared code.
> >
> > bq. If we do this can we also move out hbck version 1?
> >
> > As HBCK2 tool will be freshly written, we can try to achieve this goal. I
> > think its great idea to move hbck1 to new repo as well. Though I think
> its
> > more involved with hbck1 as the existing code already uses what it can
> from
> > hbase-common and hbase-server etc. modules.
> >
> > bq. How often does HBCK make decisions on how to implement a correction
> > based on some known functionality (e.g. a bug) in a specific version(s)
> > of HBase. Concretely, would HBCK need to make corrections to an HBase
> > installation that are specific to a subset of HBase 2.x.y versions that
> > may not be valid for other 2.x.y versions?
> >
> > I see if this happens too often, compatibility metrics will be
> complicated.
> >
> > Thanks,
> > Umesh
> >
> >
> > On Tue, Jul 24, 2018 at 10:27 AM Andrew Purtell 
> wrote:
> >
> >> If we do this can we also move out hbck version 1? It would be really
> weird
> >> in my opinion to have v2 in a separate repo but v1 shipping with the 1.x
> >> releases. That would be a source of understandable confusion.
> >>
> >> I believe our compatibility guidelines allow us to upgrade interface
> >> annotations from private to LP or Public and from LP to Public. These
> are
> >> not changes that impact source or binary compatibility. They only change
> >> the promises we make going forward about their stability. I believe we
> can
> >> allow these in new minors, so we could potentially move hbck out in a
> >> 1.5.0.
> >>
> >>
> >> On Mon, Jul 23, 2018 at 4:46 PM Stack  wrote:
> >>
> >>> On Thu, Jul 19, 2018 at 2:09 PM Umesh Agashe
> >>  
> >>> wrote:
> >>>
>  Hi,
> 
>  I've had the opportunity to talk about HBCK2 with a few of you. One of
> >>> the
>  suggestions is to to have a separate git repository for HBCK2. Lets
> >>> discuss
>  about it.
> 
>  In the past when bugs were found in hbck, there is no easy way to
> >> release
>  patched version of just hbck (without patching HBase). If HBCK2 has a
>  separate git repo, HBCK2 versions will not be tightly related to HBase
>  versions. Fixing and releasing hbck2, may not require patching HBase.
>  Though tight coupling will be somewhat loosened, HBCK2 will still
> >> depend
> >>> on
>  HBase APIs/ code. Caution will be required going forward regarding
>  compatibility.
> 
>  What you all think?
> 
> 
> >>> I think this the way to go.
> >>>
> >>> We'd make a new hbase-hbck2 repo as we did for hbase-thirdparty?
> >>>
> >>> We'd use the hbase JIRA for hbase-hbck2 issues?
> >>>
> >>> We'd make hbase-hbck2 releases on occasion that the PMC voted on?
> >>>
> >>> Sounds great!
> >>> St.Ack
> >>>
> >>> Thanks,
>  Umesh
> 
>  JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
>  Doc:
> 
> 
> >>>
> >>
> https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing
> 
> >>>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from truth's
> >> decrepit hands
> >> - A23, Crosstalk
> >>
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


Re: [DISCUSS] Separate Git Repository for HBCK2

2018-07-25 Thread Josh Elser
Thanks, Umesh. Seems like you're saying it's not a problem now, but 
you're not sure if it would become a problem. Regardless of that, it's a 
goal to not be version-specific (and thus, we can have generic hbck-v1 
and hbck-v2 tools). LMK if I misread, please :)


One more thought, it would be nice to name this repository as 
"operator-tools" or similar (instead of hbck). A separate repo on its 
own release cadence is a nice vehicle for random sorts of recovery, 
slice-and-dice, one-off tools. I think HBCK is one example of 
administrator/operator tooling we provide (certainly, the most used), 
but we have the capacity to provide more than just that.


On 7/24/18 5:55 PM, Umesh Agashe wrote:

Thanks Stack, Josh and Andrew for your suggestions and concerns.

I share Stack's suggestions. This would be similar to hbase-thirdparty. The
new repo could be hbase-hbck/hbase-hbck2. As this tool will be used by
hbase users/ developers, hbase JIRA can be used for hbck issues.

bq. How often does HBCK need to re-use methods and constants from code
in hbase-common, hbase-server, etc?
bq. Is it a goal to firm up API stability around this shared code.

bq. If we do this can we also move out hbck version 1?

As HBCK2 tool will be freshly written, we can try to achieve this goal. I
think its great idea to move hbck1 to new repo as well. Though I think its
more involved with hbck1 as the existing code already uses what it can from
hbase-common and hbase-server etc. modules.

bq. How often does HBCK make decisions on how to implement a correction
based on some known functionality (e.g. a bug) in a specific version(s)
of HBase. Concretely, would HBCK need to make corrections to an HBase
installation that are specific to a subset of HBase 2.x.y versions that
may not be valid for other 2.x.y versions?

I see if this happens too often, compatibility metrics will be complicated.

Thanks,
Umesh


On Tue, Jul 24, 2018 at 10:27 AM Andrew Purtell  wrote:


If we do this can we also move out hbck version 1? It would be really weird
in my opinion to have v2 in a separate repo but v1 shipping with the 1.x
releases. That would be a source of understandable confusion.

I believe our compatibility guidelines allow us to upgrade interface
annotations from private to LP or Public and from LP to Public. These are
not changes that impact source or binary compatibility. They only change
the promises we make going forward about their stability. I believe we can
allow these in new minors, so we could potentially move hbck out in a
1.5.0.


On Mon, Jul 23, 2018 at 4:46 PM Stack  wrote:


On Thu, Jul 19, 2018 at 2:09 PM Umesh Agashe




wrote:


Hi,

I've had the opportunity to talk about HBCK2 with a few of you. One of

the

suggestions is to to have a separate git repository for HBCK2. Lets

discuss

about it.

In the past when bugs were found in hbck, there is no easy way to

release

patched version of just hbck (without patching HBase). If HBCK2 has a
separate git repo, HBCK2 versions will not be tightly related to HBase
versions. Fixing and releasing hbck2, may not require patching HBase.
Though tight coupling will be somewhat loosened, HBCK2 will still

depend

on

HBase APIs/ code. Caution will be required going forward regarding
compatibility.

What you all think?



I think this the way to go.

We'd make a new hbase-hbck2 repo as we did for hbase-thirdparty?

We'd use the hbase JIRA for hbase-hbck2 issues?

We'd make hbase-hbck2 releases on occasion that the PMC voted on?

Sounds great!
St.Ack

Thanks,

Umesh

JIRA:  https://issues.apache.org/jira/browse/HBASE-19121.
Doc:





https://docs.google.com/document/d/1NxSFu4TKQ6lY-9J5qsCcJb9kZOnkfX66KMYsiVxBy0Y/edit?usp=sharing







--
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
- A23, Crosstalk





Re: HBase nightly job failing forever

2018-07-25 Thread Allen Wittenauer


> On Jul 25, 2018, at 10:48 AM, Chris Lambertus  wrote:
> 
> On-demand resources are certainly being considered (and we had these in the 
> past,) but I will point out that ephemeral (“on-demand”) cloud builds are in 
> direct opposition to some of the points brought up by Allen in the other 
> jenkins storage thread, in that they tend to rely on persistent object 
> storage in their workspaces to improve the efficiency of their builds. 
> Perhaps this would be less of an issue with an on-demand instance which would 
> theoretically have no resource contention?

Likely. 

A lot of work went into greatly reducing the amount of time Hadoop 
spent in the build queue and running on the nodes. It was “the big one” but I 
feel like that’s not so true or at least harder to prove anymore.  I estimate 
we shaved days off of the queue from 5 years ago.  Part of that was keeping 
caches, since the ‘Hadoop’ queue nodes were large.  But I feel like 
significantly more work went into “reducing the stupidity” out of the CI jobs 
though. 

Two examples:

* For source changes, only building and unit testing the relevant parts 
of a patch. e.g., a patch that changes code in module A should only see module 
A’s unit tests run.  Let the nightlies sort out any inter-module brokenness 
post-commit.

* if a patch is for documentation, only run mvn site.  If a patch is 
for shell code, only run shellcheck and relevant unit tests. Running the java 
unit tests is pointless.

Building everything every time is a waste of time for modularized 
source trees.

Combined with the walls put up around the docker containers (e.g., 
limiting how many processes can be launched at one time, memory limits, etc), I 
personally felt much better that, other than disk space, the Hadoop jobs were 
being exemplary citizens vs. pre-Yetus.

Re: HBase nightly job failing forever

2018-07-25 Thread Chris Lambertus


> On Jul 25, 2018, at 10:34 AM, Andrew Purtell  wrote:
> 

> public clouds instead. I'm not sure if the ASF is set up to manage on
> demand billing for test resources but this could be advantageous. It would
> track actual usage not fixed costs. To avoid budget overrun there would be
> caps and limits. Eventually demand would hit this new ceiling but the


On-demand resources are certainly being considered (and we had these in the 
past,) but I will point out that ephemeral (“on-demand”) cloud builds are in 
direct opposition to some of the points brought up by Allen in the other 
jenkins storage thread, in that they tend to rely on persistent object storage 
in their workspaces to improve the efficiency of their builds. Perhaps this 
would be less of an issue with an on-demand instance which would theoretically 
have no resource contention?


-Chris
ASF Infra


> --
> Best regards,
> Andrew




signature.asc
Description: Message signed with OpenPGP


Re: HBase nightly job failing forever

2018-07-25 Thread Andrew Purtell
Thanks Joan and Bertrand.

> The number of failed builds in our stream that are directly related to
this "tragedy of the commons" far exceeds the number of successful builds
at this point, and unfortunately Travis CI is having parallel capacity
issues that prevent us from moving to them wholesale as well.

This has been my experience. So at one point years ago I moved my work off
the shared pool at the ASF as an individual contributor and have been
funding the testing I personally do up on EC2 out of pocket. This isn't a
general solution for our project, though, as it depends on my time and
ability to contribute, and focuses only on what I'm doing at the moment
maybe not what the project would like to see happen most.

I will look into targeted donation at my employer but am not optimistic.

It might be better to look at decommissioning some if not most of the
overutilized fixed test resources and use on demand executors launched on
public clouds instead. I'm not sure if the ASF is set up to manage on
demand billing for test resources but this could be advantageous. It would
track actual usage not fixed costs. To avoid budget overrun there would be
caps and limits. Eventually demand would hit this new ceiling but the
impact would be longer queue waiting times not job failures due to
environmental stress, so that would be an improvement. Each job would run
in its own virtual server or container so would be free of many of the
environmental issues we see now. Or to get the same improvement on the
resources we have now limit executor parallelism. Better to have a job wait
in queue than to run and fail anyway because the host environment is under
stress. For what it's worth.


On Wed, Jul 25, 2018 at 10:20 AM Joan Touzet  wrote:

> I'll speak to CouchDB - the donation is directly in the form of a Jenkins
> build agent with our tag, no money is changed hands. The donator received
> a letter from fundraising@a.o allowing for tax deduction on the equivalent
> amount that the ASF leasing the machine would have cost for a year's
> donation. We have 24x7 support on the node from the provider, who performs
> all sysadmin (rather than burdening Infra with having to run puppet on our
> build machine). This was arranged so we could have a FreeBSD node in the
> build array.
>
> We have another donator in the wings who will be adding a build node for
> us; at that point, we expect to move all of our builds to our own Jenkins
> build agents and won't be in the common pool any longer. The number of
> failed builds in our stream that are directly related to this "tragedy of
> the commons" far exceeds the number of successful builds at this point,
> and unfortunately Travis CI is having parallel capacity issues that prevent
> us from moving to them wholesale as well.
>
> -Joan
>
> - Original Message -
> From: "Andrew Purtell" 
> To: ipv6g...@gmail.com
> Cc: "Andrew Purtell" , "dev" ,
> bui...@apache.org
> Sent: Wednesday, July 25, 2018 12:22:08 PM
> Subject: Re: HBase nightly job failing forever
>
> How does a targeted hardware donation work? I was under the impression that
> targeted donations are not accepted by the ASF. Maybe it is different in
> infrastructure, but this is the first time I've heard of it. Who does the
> donation on those projects? DataStax for Cassandra? Who for CouchDB? Google
> for Beam? By what process are the donations made and how are they audited
> to confirm the donation is spent on the desired resources? Can we get a
> contact for one of them for testimonial regarding this process? Is this
> process documented?
>
>
>
>
> On Tue, Jul 24, 2018 at 4:27 PM Gav  wrote:
>
> > Hi Andrew,
> >
> > On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell 
> > wrote:
> >
> >> Thanks for this note.
> >>
> >> I'm release managing the 1.4 release. I have been running the unit test
> >> suite on reasonably endowed EC2 instances and there are no observed
> always
> >> failing tests. A few can be flaky. In comparison the Apache test
> resources
> >> have been heavily resource constrained for years and frequently suffer
> from
> >> environmental effects like botched settings, disk space issues, and
> >> contention with other test executors.
> >>
> >
> > Our Jenkins nodes are configured via puppet these days and are pretty
> > stable, to which settings do you know of that might (still) be botched?
> > Yes, resources are shared and on occasion run to capacity. This is one
> > reason for my initial mail - these HBase builds are consuming 10 or more
> > executors
> > -at the same time- and are starving executors for other builds. The fact
> > these tests have been failing for well over a month and that you mention
> > below  will be
> > ignoring them does not make for good cross ASF community spirit, we are
> > all in this together and every little bit helps. This is not a target at
> > one project, others
> > will be getting a similar note and I hope we can come to a resolution
> > suitable for all.
> > Disk space 

Re: HBase nightly job failing forever

2018-07-25 Thread Joan Touzet
I'll speak to CouchDB - the donation is directly in the form of a Jenkins
build agent with our tag, no money is changed hands. The donator received
a letter from fundraising@a.o allowing for tax deduction on the equivalent
amount that the ASF leasing the machine would have cost for a year's
donation. We have 24x7 support on the node from the provider, who performs
all sysadmin (rather than burdening Infra with having to run puppet on our
build machine). This was arranged so we could have a FreeBSD node in the
build array.

We have another donator in the wings who will be adding a build node for
us; at that point, we expect to move all of our builds to our own Jenkins
build agents and won't be in the common pool any longer. The number of
failed builds in our stream that are directly related to this "tragedy of
the commons" far exceeds the number of successful builds at this point,
and unfortunately Travis CI is having parallel capacity issues that prevent
us from moving to them wholesale as well.

-Joan

- Original Message -
From: "Andrew Purtell" 
To: ipv6g...@gmail.com
Cc: "Andrew Purtell" , "dev" , 
bui...@apache.org
Sent: Wednesday, July 25, 2018 12:22:08 PM
Subject: Re: HBase nightly job failing forever

How does a targeted hardware donation work? I was under the impression that
targeted donations are not accepted by the ASF. Maybe it is different in
infrastructure, but this is the first time I've heard of it. Who does the
donation on those projects? DataStax for Cassandra? Who for CouchDB? Google
for Beam? By what process are the donations made and how are they audited
to confirm the donation is spent on the desired resources? Can we get a
contact for one of them for testimonial regarding this process? Is this
process documented?




On Tue, Jul 24, 2018 at 4:27 PM Gav  wrote:

> Hi Andrew,
>
> On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell 
> wrote:
>
>> Thanks for this note.
>>
>> I'm release managing the 1.4 release. I have been running the unit test
>> suite on reasonably endowed EC2 instances and there are no observed always
>> failing tests. A few can be flaky. In comparison the Apache test resources
>> have been heavily resource constrained for years and frequently suffer from
>> environmental effects like botched settings, disk space issues, and
>> contention with other test executors.
>>
>
> Our Jenkins nodes are configured via puppet these days and are pretty
> stable, to which settings do you know of that might (still) be botched?
> Yes, resources are shared and on occasion run to capacity. This is one
> reason for my initial mail - these HBase builds are consuming 10 or more
> executors
> -at the same time- and are starving executors for other builds. The fact
> these tests have been failing for well over a month and that you mention
> below  will be
> ignoring them does not make for good cross ASF community spirit, we are
> all in this together and every little bit helps. This is not a target at
> one project, others
> will be getting a similar note and I hope we can come to a resolution
> suitable for all.
> Disk space issues , yes, not on most of the Hadoop and related projects
> nodes - H0-H12 do not have disk space issues. As a Hadoop related project
> HBase should really be concentrating its builds there.
>
>
>> I think a 1.4 release will happen regardless of the job test results on
>> Apache infrastructure. I tend to ignore them as noisy and low signal.
>> Others in the HBase community don't necessarily feel the same, so please
>> don't take my viewpoint as particularly representative. We could try Alan's
>> suggestion first, before ignoring them outright.
>>
>
> No problem
>
>
>> Has anyone given thought toward expanding the pool of test build
>> resources? Or roping in cloud instances on demand? Jenkins has support for
>> that.
>>
>
> We have currently 19 Hadoop specific nodes available H0-H19 and another 28
> or so general use 'ubuntu' nodes for all to use. In addition we have
> projects
> that have targetted donated resources and the likes of Cassandra, CouchDB
> and Beam all have multiple nodes on which they have priority. I'll throw an
> idea
> out there than perhaps HBase could do something similar to increase our
> node pool and at the same time have priority on a few nodes f their own via
> a targeted
> hardware donation.
> Cloud on demand has been tried a year or two ago, we will revisit this
> also soon.
>
> Summary then, we currently have over 80 nodes connected to our Jenkins
> master - what figure did you have in mind when you say 'expanding the pool
> of test build resources' ?
>
> Thanks
>
> Gav...
>
>
>>
>> On Tue, Jul 24, 2018 at 9:16 AM Allen Wittenauer
>>  wrote:
>>
>>> I suspect the bigger issue is that the hbase tests are running
>>> on the ‘ubuntu’ machines. Since they only have ~300GB for workspaces, the
>>> hbase tests are eating a significant majority of it and likely could be
>>> dying randomly due to space issues.  [All the hbase 

Re: HBase nightly job failing forever

2018-07-25 Thread Bertrand Delacretaz
Hi,

On Wed, Jul 25, 2018 at 6:22 PM Andrew Purtell  wrote:
> ...How does a targeted hardware donation work? I was under the impression that
> targeted donations are not accepted by the ASF

This has changed, last year IIRC - there's a bit of information at
https://www.apache.org/foundation/contributing under "targeted sponsor
program".

I suppose fundraising@a.o is best for more specific questions.

Targeted sponsors are listed at http://www.apache.org/foundation/thanks.html

-Bertrand


Re: HBase nightly job failing forever

2018-07-25 Thread Andrew Purtell
How does a targeted hardware donation work? I was under the impression that
targeted donations are not accepted by the ASF. Maybe it is different in
infrastructure, but this is the first time I've heard of it. Who does the
donation on those projects? DataStax for Cassandra? Who for CouchDB? Google
for Beam? By what process are the donations made and how are they audited
to confirm the donation is spent on the desired resources? Can we get a
contact for one of them for testimonial regarding this process? Is this
process documented?




On Tue, Jul 24, 2018 at 4:27 PM Gav  wrote:

> Hi Andrew,
>
> On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell 
> wrote:
>
>> Thanks for this note.
>>
>> I'm release managing the 1.4 release. I have been running the unit test
>> suite on reasonably endowed EC2 instances and there are no observed always
>> failing tests. A few can be flaky. In comparison the Apache test resources
>> have been heavily resource constrained for years and frequently suffer from
>> environmental effects like botched settings, disk space issues, and
>> contention with other test executors.
>>
>
> Our Jenkins nodes are configured via puppet these days and are pretty
> stable, to which settings do you know of that might (still) be botched?
> Yes, resources are shared and on occasion run to capacity. This is one
> reason for my initial mail - these HBase builds are consuming 10 or more
> executors
> -at the same time- and are starving executors for other builds. The fact
> these tests have been failing for well over a month and that you mention
> below  will be
> ignoring them does not make for good cross ASF community spirit, we are
> all in this together and every little bit helps. This is not a target at
> one project, others
> will be getting a similar note and I hope we can come to a resolution
> suitable for all.
> Disk space issues , yes, not on most of the Hadoop and related projects
> nodes - H0-H12 do not have disk space issues. As a Hadoop related project
> HBase should really be concentrating its builds there.
>
>
>> I think a 1.4 release will happen regardless of the job test results on
>> Apache infrastructure. I tend to ignore them as noisy and low signal.
>> Others in the HBase community don't necessarily feel the same, so please
>> don't take my viewpoint as particularly representative. We could try Alan's
>> suggestion first, before ignoring them outright.
>>
>
> No problem
>
>
>> Has anyone given thought toward expanding the pool of test build
>> resources? Or roping in cloud instances on demand? Jenkins has support for
>> that.
>>
>
> We have currently 19 Hadoop specific nodes available H0-H19 and another 28
> or so general use 'ubuntu' nodes for all to use. In addition we have
> projects
> that have targetted donated resources and the likes of Cassandra, CouchDB
> and Beam all have multiple nodes on which they have priority. I'll throw an
> idea
> out there than perhaps HBase could do something similar to increase our
> node pool and at the same time have priority on a few nodes f their own via
> a targeted
> hardware donation.
> Cloud on demand has been tried a year or two ago, we will revisit this
> also soon.
>
> Summary then, we currently have over 80 nodes connected to our Jenkins
> master - what figure did you have in mind when you say 'expanding the pool
> of test build resources' ?
>
> Thanks
>
> Gav...
>
>
>>
>> On Tue, Jul 24, 2018 at 9:16 AM Allen Wittenauer
>>  wrote:
>>
>>> I suspect the bigger issue is that the hbase tests are running
>>> on the ‘ubuntu’ machines. Since they only have ~300GB for workspaces, the
>>> hbase tests are eating a significant majority of it and likely could be
>>> dying randomly due to space issues.  [All the hbase workspace directories +
>>> the yetus-m2 shared mvn cache dirs easily consume 20%+ of the space.
>>> Significantly more than the 50 or so other jobs that run on those
>>> machines.]
>>>
>>> By comparison, most of the ‘Hadoop’ nodes have 2-3TB for the big
>>> jobs to consume….
>>>
>>>
>>> > On Jul 24, 2018, at 8:58 AM, Josh Elser  wrote:
>>> >
>>> > Yep, sadly this is a very long tent-pole for us. There are many
>>> involved who have invested countless hours in making this better.
>>> >
>>> > Specific to that job you linked earlier, 3 test failures out of our
>>> total 4958 tests (0.06% failure rate) is all but "green" in my mind. I
>>> would ask that you keep that in mind, too.
>>> >
>>> > To that extent, others have also built another job specifically to
>>> find tests which are failing intermittently:
>>> https://builds.apache.org/job/HBase-Find-Flaky-Tests/25513/artifact/dashboard.html.
>>> I mention this as evidence to prove to you that this is not a baseless
>>> request from the HBase PMC ;)
>>> >
>>> > On 7/24/18 3:14 AM, Gav wrote:
>>> >> Ok, good enough, will wait, please also note 'master' branch and a few
>>> >> others have been failing for over a month also.
>>> >> I will check in again next month to see 

[jira] [Resolved] (HBASE-20746) Release 2.1.0

2018-07-25 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-20746.
---
Resolution: Fixed

> Release 2.1.0
> -
>
> Key: HBASE-20746
> URL: https://issues.apache.org/jira/browse/HBASE-20746
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
>
> After HBASE-20708 I do no think we will have unresolvable problems for 2.1.0 
> release any more. So let's create a issue to track the release processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] test-for-tests in precommit

2018-07-25 Thread Sean Busbey
circling back on this, be aware that precommit has been updated so
that test-for-tests won't vote -1 on a contribution now. if the plugin
can't find tests it'll give an advisory -0.

On Fri, Jul 13, 2018 at 10:28 AM, Sean Busbey  wrote:
> Hi folks!
>
> Given how often we end up accepting contributions despite
> test-for-tests complaining about lack of changed or new tests would
> anyone be opposed to me changing its vote from -1 to -0?
>
> The rationale for discounting its -1 looked reasonable in the issues I
> sampled. It usually was either some change that fixes a problem we
> can't test due to limitations in our test suite or  an optimization
> that's covered by existing tests.
>
> Maybe in the future if we get to a point where we're including nightly
> feature-specific cluster tests we could update it to recognize changes
> to that and then turn it back to having a vote that can fail the
> precommit test run.
>
> --
> Sean


[jira] [Created] (HBASE-20940) HStore.cansplit should not be allow split to happen if it has references

2018-07-25 Thread Vishal Khandelwal (JIRA)
Vishal Khandelwal created HBASE-20940:
-

 Summary: HStore.cansplit should not be allow split to happen if it 
has references
 Key: HBASE-20940
 URL: https://issues.apache.org/jira/browse/HBASE-20940
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.3.2
Reporter: Vishal Khandelwal
Assignee: Vishal Khandelwal


When split happens ans immediately another split happens, it may result into a 
split of a region who still has references to its parent. More details about 
scenario can be found here HBASE-20933

HStore.hasReferences should check from fs.storefile rather than in memory 
objects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20939) There will be race when we call suspendIfNotReady and then throw ProcedureSuspendedException

2018-07-25 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-20939:
-

 Summary: There will be race when we call suspendIfNotReady and 
then throw ProcedureSuspendedException
 Key: HBASE-20939
 URL: https://issues.apache.org/jira/browse/HBASE-20939
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang


This is very typical usage in our procedure implementation, for example, in 
AssignProcedure, we will call AM.queueAssign and then suspend ourselves to wait 
until the AM finish processing our assign request.

But there could be races. Think of this:
1. We call suspendIfNotReady on a event, and it returns true so we need to wait.
2. The event has been waked up, and the procedure will be added back to the 
scheduler.
3. A worker picks up the procedure and finishes it.
4. We finally throw ProcedureSuspendException and the ProcedureExecutor suspend 
us and store the state in procedure store.

So we have a half done procedure in the procedure store for ever... This may 
cause assertion when loading procedures. And maybe the worker can not finish 
the procedure as when suspending we need to restore some state, for example, 
add something to RootProcedureState. But anyway, it will still lead to 
assertion or other unexpected errors.

And this can not be done by simply adding a lock in the procedure, as most 
works are done in the ProcedureExecutor after we throw 
ProcedureSuspendException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: HBase nightly job failing forever

2018-07-25 Thread Greg Stein
On Wed, Jul 25, 2018 at 2:36 AM Robert Munteanu  wrote:

> Hi,
>
> On Wed, 2018-07-25 at 09:27 +1000, Gav wrote:
> > Disk space issues , yes, not on most of the Hadoop and related
> > projects
> > nodes - H0-H12 do not have disk space issues. As a Hadoop related
> > project
> > HBase should really be concentrating its builds there.
>
> A suggestion from the sidelines. We could add a 'large-disk-space'
> label to the jobs that use a lot of disk space and then also attach it
> to the executors that offer a lot of disk space.
>

Sure ... though I think it is the small disk that is the odd-man-out. The
nodes usually have lots of space, but a some few don't. And we're also
provisioning with much larger disks nowadays.

Cheers,
-g


Re: HBase nightly job failing forever

2018-07-25 Thread Robert Munteanu
Hi,

On Wed, 2018-07-25 at 09:27 +1000, Gav wrote:
> Disk space issues , yes, not on most of the Hadoop and related
> projects
> nodes - H0-H12 do not have disk space issues. As a Hadoop related
> project
> HBase should really be concentrating its builds there.

A suggestion from the sidelines. We could add a 'large-disk-space'
label to the jobs that use a lot of disk space and then also attach it
to the executors that offer a lot of disk space.

Robert



[jira] [Created] (HBASE-20938) Set version to 2.1.1-SNAPSHOT for branch-2.1

2018-07-25 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-20938:
-

 Summary: Set version to 2.1.1-SNAPSHOT for branch-2.1
 Key: HBASE-20938
 URL: https://issues.apache.org/jira/browse/HBASE-20938
 Project: HBase
  Issue Type: Sub-task
  Components: build
Reporter: Duo Zhang
Assignee: Duo Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20746) Release 2.1.0

2018-07-25 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang reopened HBASE-20746:
---

> Release 2.1.0
> -
>
> Key: HBASE-20746
> URL: https://issues.apache.org/jira/browse/HBASE-20746
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
>
> After HBASE-20708 I do no think we will have unresolvable problems for 2.1.0 
> release any more. So let's create a issue to track the release processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20937) Update the support matrix in our ref guide about the recent hadoop releases

2018-07-25 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-20937:
-

 Summary: Update the support matrix in our ref guide about the 
recent hadoop releases
 Key: HBASE-20937
 URL: https://issues.apache.org/jira/browse/HBASE-20937
 Project: HBase
  Issue Type: Task
  Components: documentation
Reporter: Duo Zhang
 Fix For: 3.0.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)