Re: [openstack-dev] [trove] Adding support for HBase in Trove
While I applaud raising the issue on the mailing list to get more folks to weigh in, I think part of the problem maybe the lack of a [sahara] tag on the subject. The thread is still tagged to be a Trove centric conversation. All respondents please consider adding [sahara] to the subject. Thanks, Kevin From: Amrith Kumar [amr...@tesora.com] Sent: Thursday, January 07, 2016 1:59 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > -Original Message- > From: michael mccune [mailto:m...@redhat.com] > Sent: Thursday, January 07, 2016 3:12 PM > To: openstack-dev@lists.openstack.org > Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > > On 01/07/2016 11:59 AM, Amrith Kumar wrote: > > From the things that you and Pete (Peter MacKinnon) are saying, I don't > understand why there is an objection to accepting the currently proposed > implementation which is clearly for single node deployments? Both > Standalone and Pseudo-Distributed are by definition, explicitly, necessarily, > absolutely, positively, definitely single node. I can't be more explicit about > that. That's all that is being proposed at this time. See more comments > below. > > i didn't think i explicitly objected to the spec, if it seems that way then i > apologize. after reading the spec and the comments, it seemed that there > was some question about engagement with the sahara team. i wanted to > help bring some light to the issues surrounding deploying hbase and thought > it would be good to participate in the discussion. You are correct Michael. There was a suggestion that we should engage with the Sahara team (in the Trove team meeting yesterday) and that is what prompted this email thread. So I appreciate your participation as one who is a member of the Sahara team. > > > Further, the current proposal also chooses an implementation strategy that > makes it much easier to handle fully-distributed in a different way in the > future. Consider this, Trove could equally well have dealt with HBase using a > single datastore for all operating modes. In the current implementation, one > would create a HBase standalone instance using a command that included: > > > > --datastore hbase-standalone > > > > And a pseudo-distributed instance by including > > > > --datastore hbase-pseudo-distributed. > > > > and this delineation sounds reasonable to me > > > Trove could equally well function by having a single datastore (hbase) but > this would make hbase-fully-distributed harder to do in a different way in the > future. I consciously eschewed that path, for this very specific reason; it > would limit choice in the future. > > agreed > > > Now, the implementation behind hbase-fully-distributed could be a > custom Trove guest agent that could (if we decided to go that route) interact > with Sahara. However, an alternative implementation of hbase-fully- > distributed could orchestrate everything natively in Trove. There is much > flexibility in the current proposal, and I submit to you that this is being > lost in > your reading of the specification and the current implementation as > proposed. > > i don't think your characterization of my reading comprehension is fair. > as i stated earlier, i wanted to participate in the discussion surrounding > deploying a technology that sahara currently deploys. fwiw, i agree with what > you are saying here, but i also think it is axiomatic, the trove team can > choose > whichever path it would like for implementation. > > >> i think this sounds reasonable, as long as we are limiting it to > >> standalone mode. if the deployments start to take on a larger scope i > >> agree it would be useful to leverage sahara for provisioning and scaling. > > > > Why only standalone? The current proposal explicitly covers only > standalone and pseudo-distributed which are both valid strictly (add other > adjectives here to taste) single node topologies and the currently submitted > specification specifically carves out fully-distributed operation as requiring > further thought and contemplation. > > i think starting with standalone mode (and not pseudo-distributed) is a more > conservative approach to this. my reason for suggesting limiting this to > standalone is that even in pseudo-distributed mode the need for managing > hdfs and zookeeper are present, i wanted to highlight some of of the overlap > and the issues that will start to creep in surrounding this deployment. > The current code (submitted for review) provides both standalone and pseudo-distributed
Re: [openstack-dev] [trove] Adding support for HBase in Trove
On 1/6/16 8:20 PM, Amrith Kumar wrote: Kevin Fox writes: as far as that plugin ever should go. If you need scale up/down, etc, then your starting to reimplement large swaths of Sahara, and like the Cinder plugin for Nova, there could be a plugin that works identically to the stand alone one that converts the same api over to a Sahara compatible one. You then farm the work over to Sahara. I believe that this is not the case. The entire framework for integration with Cinder, Nova etc., already exists in Trove. Recall that trove already deals with about a dozen databases, several of which have support for clusters. The code to add HBase support to trove doesn't have to implement all of this framework that already exists. All that is being implemented is (literally) a Trove 'plugin' for HBase and a mechanism to build a HBase guest image. -amrith Right, I think that's the concern. A plugin for integration with a standalone/pseudo-distributed Hbase deployment has arguably a reasonable scale to be managed by a Trove guestagent. That agent would also fire up the client RPC services necessary for an end user to interact with Hbase remotely. But even the Hbase project views standalone mode as a devel/test capability only. The fully distributed model gets orders of magnitude more complex. Is the agent plugin just wiring into an existing multi-node Hbase deployment somewhere? Is it spawning/growing/shrinking HDFS endpoints itself? The "we already have cluster support in Trove" argument doesn't really track in a production Hadoop space, IMHO. That's why Sahara was developed. My $0.02, \Pete -Original Message- From: Fox, Kevin M [mailto:kevin@pnnl.gov] Sent: Wednesday, January 06, 2016 7:32 PM To: OpenStack Development Mailing List (not for usage questions) <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove just my 2 cents... I think you can do both. The great thing about Trove is that its providing an abstract api so users just deal with provisioning db's, scaling db's, etc. Having a simple plugin that doesn't depend on all of Sahara, for the case a user only wants a single node HBase does make sense. Its much easier for an Op to support that case if thats all their users ever want. But, thats probably as far as that plugin ever should go. If you need scale up/down, etc, then your starting to reimplement large swaths of Sahara, and like the Cinder plugin for Nova, there could be a plugin that works identically to the stand alone one that converts the same api over to a Sahara compatible one. You then farm the work over to Sahara. Then, its up to the ops to choose features and the overhead of supporting Sahara, or not, and you don't have to support implementing a whole cluster management system for Trove that already exists. Thanks, Kevin From: Amrith Kumar [amr...@tesora.com] Sent: Wednesday, January 06, 2016 3:15 PM To: OpenStack Development Mailing List (not for usage questions) Subject: [openstack-dev] [trove] Adding support for HBase in Trove TL;DR Should Trove treat HBase as a special database because one use case is as part of a large multi-node Hadoop cluster, and therefore either not support it at all, or necessarily use Sahara to provision and manage a cluster? There are pro's and con's and it is argued that the con's outweigh the pro's and a blueprint/specification, and an implementation for basic Trove support for HBase independent of Sahara has been submitted for review. See [3], [4] and [5]. The benefits include the ability to provide the commonly used (in development) standalone mode operation, and eliminate the dependency on an additional OpenStack project thereby simplifying deployment. Comments and feedback are welcome on the implementation, as well as the specification and the approach. The long version follows below. The OpenStack Trove mission is to provide scalable and reliable Cloud Database as a Service provisioning functionality for both relational and non- relational database engines, and to continue to improve its fully-featured and extensible open source framework [1]. An important aspect of the Trove value proposition is that it provides a common control plane, a common API, and a common set of abstractions are used to manage a number of different relational, and non-relational database technologies. The common API contains primitives to create database instances and clusters of a number of databases including MySQL (MariaDB, Percona too), PostgreSQL, MongoDB, Cassandra, CouchDB, Couchbase, IBM DB2, Vertica, and Redis. Cluster support is also available for a number of databases including MongoDB, Percona XtraDB cluster and Vertica, with more to come imminently. In effect, Trove is a framework for provisioning and managing the lifecycle of a number of different database technologies; it provides only the control plane. Users ca
Re: [openstack-dev] [trove] Adding support for HBase in Trove
Michael, Pete, please see comments interspersed below. >From the things that you and Pete (Peter MacKinnon) are saying, I don't >understand why there is an objection to accepting the currently proposed >implementation which is clearly for single node deployments? Both Standalone >and Pseudo-Distributed are by definition, explicitly, necessarily, absolutely, >positively, definitely single node. I can't be more explicit about that. >That's all that is being proposed at this time. See more comments below. Further, the current proposal also chooses an implementation strategy that makes it much easier to handle fully-distributed in a different way in the future. Consider this, Trove could equally well have dealt with HBase using a single datastore for all operating modes. In the current implementation, one would create a HBase standalone instance using a command that included: --datastore hbase-standalone And a pseudo-distributed instance by including --datastore hbase-pseudo-distributed. Trove could equally well function by having a single datastore (hbase) but this would make hbase-fully-distributed harder to do in a different way in the future. I consciously eschewed that path, for this very specific reason; it would limit choice in the future. Now, the implementation behind hbase-fully-distributed could be a custom Trove guest agent that could (if we decided to go that route) interact with Sahara. However, an alternative implementation of hbase-fully-distributed could orchestrate everything natively in Trove. There is much flexibility in the current proposal, and I submit to you that this is being lost in your reading of the specification and the current implementation as proposed. -amrith > -Original Message- > From: michael mccune [mailto:m...@redhat.com] > Sent: Thursday, January 07, 2016 11:18 AM > To: openstack-dev@lists.openstack.org > Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > > thanks for bringing this up Amrith, > > On 01/06/2016 07:31 PM, Fox, Kevin M wrote: > > Having a simple plugin that doesn't depend on all of Sahara, for the case a > user only wants a single node HBase does make sense. Its much easier for an > Op to support that case if thats all their users ever want. But, thats > probably > as far as that plugin ever should go. If you need scale up/down, etc, then > your starting to reimplement large swaths of Sahara, and like the Cinder > plugin for Nova, there could be a plugin that works identically to the stand > alone one that converts the same api over to a Sahara compatible one. You > then farm the work over to Sahara. > > i think this sounds reasonable, as long as we are limiting it to standalone > mode. if the deployments start to take on a larger scope i agree it would be > useful to leverage sahara for provisioning and scaling. Why only standalone? The current proposal explicitly covers only standalone and pseudo-distributed which are both valid strictly (add other adjectives here to taste) single node topologies and the currently submitted specification specifically carves out fully-distributed operation as requiring further thought and contemplation. > > as the hbase installation grows beyond the standalone mode there will > necessarily need to be hdfs and zookeeper support to allow for a proper > production deployment. this also brings up questions of allowing the end- > users to supply configurations for the hdfs and zookeeper processes, not to > mention enabling support for high availability hdfs. These are things that Trove already addresses, albeit in a different way than Sahara. Users can, as it turns out, specify configuration groups which can then be used to launch new instances, and can also be associated with groups of instances. > > i can envision a scenario where trove could use sahara to provision and > manage the clusters for hbase/hdfs/zk. this does pose some questions as > we'd have to determine how the trove guest agent would be installed on the > nodes, if there will need to be custom configurations used by trove, and if > sahara will need to provide a plugin for bare (meaning no data processing > framework) hbase/hdfs/zk clusters. but, i think these could be solved by > either using custom images or a plugin in sahara that would install the > necessary agents/configurations. Let us not underestimate the effort for an end user to now deploy one more project. To a user already using Trove for a myriad of databases, requiring Sahara for supporting HBase Standalone sounds (to put it bluntly) a burden. Requiring it for Fully-Distributed mode may have some development benefits but it remains to be seen whether those benefits are really worth the contortions that Trove would have to go through. And in the Trove architecture,
Re: [openstack-dev] [trove] Adding support for HBase in Trove
I don't work on Sahara, but I do work on a similar closed-source project. FWIW, I agree with Kevin here. standalone and pseudo-distributed HBase are only intended for Hbase developers to test code without having to spin up a cluster; it's not meant for operators or users to actually use as a database. Hbase is designed to run on HDFS and relies on Zookeeper for coordination as well. Unless trove is going to re-implement half of Sahara, having it there makes no sense, and will ultimately only lead to confusion among users who see Hbase and think they're getting something useful when they are in fact not. My $0.02 Greg On 1/7/16, 12:19 PM, "Fox, Kevin M" <kevin@pnnl.gov> wrote: >Oh. And I'd suggest having this conversation with the Sahara team. They >may have some interesting insight into the issue. > >Thanks, >Kevin > >From: Fox, Kevin M >Sent: Thursday, January 07, 2016 9:44 AM >To: OpenStack Development Mailing List (not for usage questions) >Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > >the whole hadoopish stack is unusual though. I suspect users often want >to slice and dice all the components that run together on the cluster, >where HBase is just one component of the shared cluster. I can totally >envision users walking up to my door saying, I provisioned this HBase >system with Trove, and now I want to run such and such job on the >cluster... Building on top of Sahara enables that kind of thing. If trove >wants to do the clustering all itself, then that's either out of the >picture, or you end up having to add lots of sahara like functionality in >the end to get its functionality back up to where users will want it. > >Thanks, >Kevin > >From: michael mccune [m...@redhat.com] >Sent: Thursday, January 07, 2016 8:17 AM >To: openstack-dev@lists.openstack.org >Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > >thanks for bringing this up Amrith, > >On 01/06/2016 07:31 PM, Fox, Kevin M wrote: >> Having a simple plugin that doesn't depend on all of Sahara, for the >>case a user only wants a single node HBase does make sense. Its much >>easier for an Op to support that case if thats all their users ever >>want. But, thats probably as far as that plugin ever should go. If you >>need scale up/down, etc, then your starting to reimplement large swaths >>of Sahara, and like the Cinder plugin for Nova, there could be a plugin >>that works identically to the stand alone one that converts the same api >>over to a Sahara compatible one. You then farm the work over to Sahara. > >i think this sounds reasonable, as long as we are limiting it to >standalone mode. if the deployments start to take on a larger scope i >agree it would be useful to leverage sahara for provisioning and scaling. > >as the hbase installation grows beyond the standalone mode there will >necessarily need to be hdfs and zookeeper support to allow for a proper >production deployment. this also brings up questions of allowing the >end-users to supply configurations for the hdfs and zookeeper processes, >not to mention enabling support for high availability hdfs. > >i can envision a scenario where trove could use sahara to provision and >manage the clusters for hbase/hdfs/zk. this does pose some questions as >we'd have to determine how the trove guest agent would be installed on >the nodes, if there will need to be custom configurations used by trove, >and if sahara will need to provide a plugin for bare (meaning no data >processing framework) hbase/hdfs/zk clusters. but, i think these could >be solved by either using custom images or a plugin in sahara that would >install the necessary agents/configurations. > >of course, this does add a layer of complexity as operators who wish >this type of deployment will need to have both trove and sahara, but imo >this would be easier than replicating the work that sahara has done with >these technologies. > >regards, >mike > >__ >OpenStack Development Mailing List (not for usage questions) >Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >__ >OpenStack Development Mailing List (not for usage questions) >Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >__ >OpenStack Development Mailing List
Re: [openstack-dev] [trove] Adding support for HBase in Trove
Oh. And I'd suggest having this conversation with the Sahara team. They may have some interesting insight into the issue. Thanks, Kevin From: Fox, Kevin M Sent: Thursday, January 07, 2016 9:44 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove the whole hadoopish stack is unusual though. I suspect users often want to slice and dice all the components that run together on the cluster, where HBase is just one component of the shared cluster. I can totally envision users walking up to my door saying, I provisioned this HBase system with Trove, and now I want to run such and such job on the cluster... Building on top of Sahara enables that kind of thing. If trove wants to do the clustering all itself, then that's either out of the picture, or you end up having to add lots of sahara like functionality in the end to get its functionality back up to where users will want it. Thanks, Kevin From: michael mccune [m...@redhat.com] Sent: Thursday, January 07, 2016 8:17 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove thanks for bringing this up Amrith, On 01/06/2016 07:31 PM, Fox, Kevin M wrote: > Having a simple plugin that doesn't depend on all of Sahara, for the case a > user only wants a single node HBase does make sense. Its much easier for an > Op to support that case if thats all their users ever want. But, thats > probably as far as that plugin ever should go. If you need scale up/down, > etc, then your starting to reimplement large swaths of Sahara, and like the > Cinder plugin for Nova, there could be a plugin that works identically to the > stand alone one that converts the same api over to a Sahara compatible one. > You then farm the work over to Sahara. i think this sounds reasonable, as long as we are limiting it to standalone mode. if the deployments start to take on a larger scope i agree it would be useful to leverage sahara for provisioning and scaling. as the hbase installation grows beyond the standalone mode there will necessarily need to be hdfs and zookeeper support to allow for a proper production deployment. this also brings up questions of allowing the end-users to supply configurations for the hdfs and zookeeper processes, not to mention enabling support for high availability hdfs. i can envision a scenario where trove could use sahara to provision and manage the clusters for hbase/hdfs/zk. this does pose some questions as we'd have to determine how the trove guest agent would be installed on the nodes, if there will need to be custom configurations used by trove, and if sahara will need to provide a plugin for bare (meaning no data processing framework) hbase/hdfs/zk clusters. but, i think these could be solved by either using custom images or a plugin in sahara that would install the necessary agents/configurations. of course, this does add a layer of complexity as operators who wish this type of deployment will need to have both trove and sahara, but imo this would be easier than replicating the work that sahara has done with these technologies. regards, mike __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [trove] Adding support for HBase in Trove
> -Original Message- > From: michael mccune [mailto:m...@redhat.com] > Sent: Thursday, January 07, 2016 3:12 PM > To: openstack-dev@lists.openstack.org > Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > > On 01/07/2016 11:59 AM, Amrith Kumar wrote: > > From the things that you and Pete (Peter MacKinnon) are saying, I don't > understand why there is an objection to accepting the currently proposed > implementation which is clearly for single node deployments? Both > Standalone and Pseudo-Distributed are by definition, explicitly, necessarily, > absolutely, positively, definitely single node. I can't be more explicit about > that. That's all that is being proposed at this time. See more comments > below. > > i didn't think i explicitly objected to the spec, if it seems that way then i > apologize. after reading the spec and the comments, it seemed that there > was some question about engagement with the sahara team. i wanted to > help bring some light to the issues surrounding deploying hbase and thought > it would be good to participate in the discussion. You are correct Michael. There was a suggestion that we should engage with the Sahara team (in the Trove team meeting yesterday) and that is what prompted this email thread. So I appreciate your participation as one who is a member of the Sahara team. > > > Further, the current proposal also chooses an implementation strategy that > makes it much easier to handle fully-distributed in a different way in the > future. Consider this, Trove could equally well have dealt with HBase using a > single datastore for all operating modes. In the current implementation, one > would create a HBase standalone instance using a command that included: > > > > --datastore hbase-standalone > > > > And a pseudo-distributed instance by including > > > > --datastore hbase-pseudo-distributed. > > > > and this delineation sounds reasonable to me > > > Trove could equally well function by having a single datastore (hbase) but > this would make hbase-fully-distributed harder to do in a different way in the > future. I consciously eschewed that path, for this very specific reason; it > would limit choice in the future. > > agreed > > > Now, the implementation behind hbase-fully-distributed could be a > custom Trove guest agent that could (if we decided to go that route) interact > with Sahara. However, an alternative implementation of hbase-fully- > distributed could orchestrate everything natively in Trove. There is much > flexibility in the current proposal, and I submit to you that this is being > lost in > your reading of the specification and the current implementation as > proposed. > > i don't think your characterization of my reading comprehension is fair. > as i stated earlier, i wanted to participate in the discussion surrounding > deploying a technology that sahara currently deploys. fwiw, i agree with what > you are saying here, but i also think it is axiomatic, the trove team can > choose > whichever path it would like for implementation. > > >> i think this sounds reasonable, as long as we are limiting it to > >> standalone mode. if the deployments start to take on a larger scope i > >> agree it would be useful to leverage sahara for provisioning and scaling. > > > > Why only standalone? The current proposal explicitly covers only > standalone and pseudo-distributed which are both valid strictly (add other > adjectives here to taste) single node topologies and the currently submitted > specification specifically carves out fully-distributed operation as requiring > further thought and contemplation. > > i think starting with standalone mode (and not pseudo-distributed) is a more > conservative approach to this. my reason for suggesting limiting this to > standalone is that even in pseudo-distributed mode the need for managing > hdfs and zookeeper are present, i wanted to highlight some of of the overlap > and the issues that will start to creep in surrounding this deployment. > The current code (submitted for review) provides both standalone and pseudo-distributed support. You will observe that the standalone and pseudo-distributed implementations do install zookeeper. As you are no doubt aware, one of the recommended ways to force the HBase Master server to always bind to a well-known port in favor of the ephemeral ports is to stipulate hbase.cluster.distributed is True (see https://review.openstack.org/#/c/262048/5/scripts/files/elements/ubuntu-hbase-standalone/install.d/20-install-hbase line 121). So, as it turns out, the code to deploy hdfs and zookeeper is already part of the proposed implementation. > >> as
[openstack-dev] [trove] Adding support for HBase in Trove
TL;DR Should Trove treat HBase as a special database because one use case is as part of a large multi-node Hadoop cluster, and therefore either not support it at all, or necessarily use Sahara to provision and manage a cluster? There are pro's and con's and it is argued that the con's outweigh the pro's and a blueprint/specification, and an implementation for basic Trove support for HBase independent of Sahara has been submitted for review. See [3], [4] and [5]. The benefits include the ability to provide the commonly used (in development) standalone mode operation, and eliminate the dependency on an additional OpenStack project thereby simplifying deployment. Comments and feedback are welcome on the implementation, as well as the specification and the approach. The long version follows below. The OpenStack Trove mission is to provide scalable and reliable Cloud Database as a Service provisioning functionality for both relational and non-relational database engines, and to continue to improve its fully-featured and extensible open source framework [1]. An important aspect of the Trove value proposition is that it provides a common control plane, a common API, and a common set of abstractions are used to manage a number of different relational, and non-relational database technologies. The common API contains primitives to create database instances and clusters of a number of databases including MySQL (MariaDB, Percona too), PostgreSQL, MongoDB, Cassandra, CouchDB, Couchbase, IBM DB2, Vertica, and Redis. Cluster support is also available for a number of databases including MongoDB, Percona XtraDB cluster and Vertica, with more to come imminently. In effect, Trove is a framework for provisioning and managing the lifecycle of a number of different database technologies; it provides only the control plane. Users can do things like provisioning instances and clusters, resizing them, taking backups and creating new instances and clusters from previous backups, establish and manage complex topologies including replication and clustering, and resize instances and clusters. Trove does interfere with the data plane, the applications interact directly with the database using the native API's for each database technology. Users of OpenStack look to Trove to provide a consistent set of interfaces for managing their database resources in a variety of use-cases ranging from small-scale prototyping, development, testing, and all the way through production. Apache HBase is an open-source, distributed, versioned, non-relational database [2] and users of HBase face many of the challenges that Trove addresses for other databases. Therefore adding support for HBase in Trove seems not only reasonable, but also consistent with the goal of the (Trove) project. A spec proposing the addition of HBase support for Trove was submitted [3] and a first phase of code implementing this HBase support has also been submitted for review [4], [5]. The process that has been followed is consistent with other Trove datastores; add basic support and then progressively augment it in subsequent releases. The code submitted allows you to provision an HBase instance (which will launch on a Nova instance), build an HBase guest image using the elements provided, resize the storage and the instance, take a "backup" of the instance and store that backup on Swift, and at a later time you can launch a new instance from that "backup". One can operate HBase with or without HDFS; in fact HBase documents the standalone mode of operation [6] where HBase is completely operational on a single node and data is stored on the local file system. This standalone mode provides a very useful construct for development and testing, and at a later stage an application can be seamlessly migrated to work with an HBase installation of some other "run mode" like "Fully Distributed". Code submitted in [4] and [5] as described in [3] implement support for two modes of operation namely "Standalone" and "Pseudo-Distributed". At a later stage, support will be added for "Fully Distributed" consistent with the way in which clustering support was delivered for other datastores like MySQL and MongoDB. Some have opined that Trove should not directly get into the business of orchestrating Hadoop Clusters or anything to do with HBase, arguing that this is something that Sahara already does, and should remain the sole domain of Sahara. I believe that since HBase is perfectly operable without HDFS, it seems inappropriate to tightly couple HBase with Sahara whose primary motivation is to provision 'data-intensive application clusters' [7]. Furthermore, as we have found with other datastores, it is my belief that having a common implementation model across multiple deployment topologies is a benefit for Trove. Other considerations such as similarity to other databases supported by Trove motivated a choice as
Re: [openstack-dev] [trove] Adding support for HBase in Trove
just my 2 cents... I think you can do both. The great thing about Trove is that its providing an abstract api so users just deal with provisioning db's, scaling db's, etc. Having a simple plugin that doesn't depend on all of Sahara, for the case a user only wants a single node HBase does make sense. Its much easier for an Op to support that case if thats all their users ever want. But, thats probably as far as that plugin ever should go. If you need scale up/down, etc, then your starting to reimplement large swaths of Sahara, and like the Cinder plugin for Nova, there could be a plugin that works identically to the stand alone one that converts the same api over to a Sahara compatible one. You then farm the work over to Sahara. Then, its up to the ops to choose features and the overhead of supporting Sahara, or not, and you don't have to support implementing a whole cluster management system for Trove that already exists. Thanks, Kevin From: Amrith Kumar [amr...@tesora.com] Sent: Wednesday, January 06, 2016 3:15 PM To: OpenStack Development Mailing List (not for usage questions) Subject: [openstack-dev] [trove] Adding support for HBase in Trove TL;DR Should Trove treat HBase as a special database because one use case is as part of a large multi-node Hadoop cluster, and therefore either not support it at all, or necessarily use Sahara to provision and manage a cluster? There are pro's and con's and it is argued that the con's outweigh the pro's and a blueprint/specification, and an implementation for basic Trove support for HBase independent of Sahara has been submitted for review. See [3], [4] and [5]. The benefits include the ability to provide the commonly used (in development) standalone mode operation, and eliminate the dependency on an additional OpenStack project thereby simplifying deployment. Comments and feedback are welcome on the implementation, as well as the specification and the approach. The long version follows below. The OpenStack Trove mission is to provide scalable and reliable Cloud Database as a Service provisioning functionality for both relational and non-relational database engines, and to continue to improve its fully-featured and extensible open source framework [1]. An important aspect of the Trove value proposition is that it provides a common control plane, a common API, and a common set of abstractions are used to manage a number of different relational, and non-relational database technologies. The common API contains primitives to create database instances and clusters of a number of databases including MySQL (MariaDB, Percona too), PostgreSQL, MongoDB, Cassandra, CouchDB, Couchbase, IBM DB2, Vertica, and Redis. Cluster support is also available for a number of databases including MongoDB, Percona XtraDB cluster and Vertica, with more to come imminently. In effect, Trove is a framework for provisioning and managing the lifecycle of a number of different database technologies; it provides only the control plane. Users can do things like provisioning instances and clusters, resizing them, taking backups and creating new instances and clusters from previous backups, establish and manage complex topologies including replication and clustering, and resize instances and clusters. Trove does interfere with the data plane, the applications interact directly with the database using the native API's for each database technology. Users of OpenStack look to Trove to provide a consistent set of interfaces for managing their database resources in a variety of use-cases ranging from small-scale prototyping, development, testing, and all the way through production. Apache HBase is an open-source, distributed, versioned, non-relational database [2] and users of HBase face many of the challenges that Trove addresses for other databases. Therefore adding support for HBase in Trove seems not only reasonable, but also consistent with the goal of the (Trove) project. A spec proposing the addition of HBase support for Trove was submitted [3] and a first phase of code implementing this HBase support has also been submitted for review [4], [5]. The process that has been followed is consistent with other Trove datastores; add basic support and then progressively augment it in subsequent releases. The code submitted allows you to provision an HBase instance (which will launch on a Nova instance), build an HBase guest image using the elements provided, resize the storage and the instance, take a "backup" of the instance and store that backup on Swift, and at a later time you can launch a new instance from that "backup". One can operate HBase with or without HDFS; in fact HBase documents the standalone mode of operation [6] where HBase is completely operational on a single node and data is stored on the local file system. This standalone mode provides a very useful const
Re: [openstack-dev] [trove] Adding support for HBase in Trove
Kevin Fox writes: > as far as that plugin ever should go. If you need scale up/down, etc, then > your starting to reimplement large swaths of Sahara, and like the Cinder > plugin for Nova, there could be a plugin that works identically to the stand > alone one that converts the same api over to a Sahara compatible one. You > then farm the work over to Sahara. I believe that this is not the case. The entire framework for integration with Cinder, Nova etc., already exists in Trove. Recall that trove already deals with about a dozen databases, several of which have support for clusters. The code to add HBase support to trove doesn't have to implement all of this framework that already exists. All that is being implemented is (literally) a Trove 'plugin' for HBase and a mechanism to build a HBase guest image. -amrith > -Original Message- > From: Fox, Kevin M [mailto:kevin@pnnl.gov] > Sent: Wednesday, January 06, 2016 7:32 PM > To: OpenStack Development Mailing List (not for usage questions) > <openstack-dev@lists.openstack.org> > Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > > just my 2 cents... I think you can do both. The great thing about Trove is > that > its providing an abstract api so users just deal with provisioning db's, > scaling > db's, etc. > > Having a simple plugin that doesn't depend on all of Sahara, for the case a > user only wants a single node HBase does make sense. Its much easier for an > Op to support that case if thats all their users ever want. But, thats > probably > as far as that plugin ever should go. If you need scale up/down, etc, then > your starting to reimplement large swaths of Sahara, and like the Cinder > plugin for Nova, there could be a plugin that works identically to the stand > alone one that converts the same api over to a Sahara compatible one. You > then farm the work over to Sahara. > > Then, its up to the ops to choose features and the overhead of supporting > Sahara, or not, and you don't have to support implementing a whole cluster > management system for Trove that already exists. > > Thanks, > Kevin > > From: Amrith Kumar [amr...@tesora.com] > Sent: Wednesday, January 06, 2016 3:15 PM > To: OpenStack Development Mailing List (not for usage questions) > Subject: [openstack-dev] [trove] Adding support for HBase in Trove > > TL;DR Should Trove treat HBase as a special database because one use case is > as part of a large multi-node Hadoop cluster, and therefore either not > support it at all, or necessarily use Sahara to provision and manage a > cluster? > There are pro's and con's and it is argued that the con's outweigh the pro's > and a blueprint/specification, and an implementation for basic Trove support > for HBase independent of Sahara has been submitted for review. See [3], [4] > and [5]. The benefits include the ability to provide the commonly used (in > development) standalone mode operation, and eliminate the dependency > on an additional OpenStack project thereby simplifying deployment. > Comments and feedback are welcome on the implementation, as well as the > specification and the approach. > > The long version follows below. > > The OpenStack Trove mission is to provide scalable and reliable Cloud > Database as a Service provisioning functionality for both relational and non- > relational database engines, and to continue to improve its fully-featured > and extensible open source framework [1]. > > An important aspect of the Trove value proposition is that it provides a > common control plane, a common API, and a common set of abstractions are > used to manage a number of different relational, and non-relational > database technologies. The common API contains primitives to create > database instances and clusters of a number of databases including MySQL > (MariaDB, Percona too), PostgreSQL, MongoDB, Cassandra, CouchDB, > Couchbase, IBM DB2, Vertica, and Redis. > > Cluster support is also available for a number of databases including > MongoDB, Percona XtraDB cluster and Vertica, with more to come > imminently. > > In effect, Trove is a framework for provisioning and managing the lifecycle of > a number of different database technologies; it provides only the control > plane. Users can do things like provisioning instances and clusters, resizing > them, taking backups and creating new instances and clusters from previous > backups, establish and manage complex topologies including replication and > clustering, and resize instances and clusters. > > Trove does interfere with the data plane, the applications interact directly > with the database using the native API's for each database technol