I guess the old adage is true. If you only have a hammer, then every problem looks like a nail. As an architect, its your role to find the right tools to be used to solve the problem in the most efficient and effective manner. So the first question you need to ask is if HBase is the right tool.
The OP’s project isn’t one that should be put in to HBase. Velocity? Volume? Variety? These are the three aspects of Big Data and they can also be used to test if a problem should be solved using HBase. You don’t need all three, but you should have at least two of the three if you have a good candidate. The other thing to consider is how you plan on using the data. If you’re not using M/R or HDFS, then you don’t want to use HBase in production. And as a good architect, you want to take the inverse of the problem and ask why not a Relational Database, or an existing Hierarchical Database. (Both technologies have been around 30+ years.) And it turns out that you can So the OP’s problem lacks the volume. It also lacks the variety. So if we ask a simple question of how to use an RDBMS to handle this… its pretty straight forward. Store the medical record(s) in either XML or JSON format. On ingestion, copy out only the fields required to identify an unique record. That’s your base record storage. Indexing could be done one of two ways. 1) You could use an inverted table. 2) You could copy out the field to be used in the index as a column and then index that column. If you use an inverted table, your schema design would translate in to HBase. Then when you access the data, you use the index to find the result set and for each record, you have the JSON object that you can use as a whole or just components. The pattern of storing the record in a single column as Text LOB and then creating indexes to identify and locate the records isn’t new. I’ve used it at a client over 15 yrs ago for an ODS implementation. In terms of HBase… Stability depends on the hardware, admin and the use cases. Its still relatively unstable. In most cases no where near 4 9’s. Considering that there is also the regulatory compliance issues … e.g. security… This alone will rule HBase out in a stand alone situation and again even with Kerberos implemented, you may not meet your security requirements. Bottom line, the OP is going to do what he’s going to do. All I can do is tell him its not a good idea, and why. This email thread is great column fodder for a blog as well as for a presentation as to why/why not HBase and Hadoop. Its something that should be included in a design lecture or lectures, but unfortunately, most of the larger conferences are driven by the vendors who have their own agendas and slots that they want to fill with marketing talks. BTW, I am really curious as to how if the OP is using a standalone instance of HBase does the immature HDFS encryption help secure his data? ;-) HTH -Mike > On Mar 13, 2015, at 3:44 PM, Sean Busbey <bus...@cloudera.com> wrote: > > On Fri, Mar 13, 2015 at 2:41 PM, Michael Segel <michael_se...@hotmail.com> > wrote: > >> >> In stand alone, you’re writing to local disk. You lose the disk you lose >> the data, unless of course you’ve raided your drives. >> Then when you lose the node, you lose the data because its not being >> replicated. While this may not be a major issue or concern… you have to be >> aware of it’s potential. >> >> > It sounds like he has this issue covered via VM imaging. > > > >> The other issue when it comes to security, HBase relies on the cluster’s >> security. >> To be clear, HBase relies on the cluster and the use of Kerberos to help >> with authentication. So that only those who have the rights to see the >> data can actually have access to it. >> >> > > He can get around this by relying on the Thrift or REST services to act an > an arbitrator, or he could make his own. So long as he separates access to > the underlying cluster / hbase apis from whatever does exposing the data, > this shouldn't be a problem. > > > >> Then you have to worry about auditing. With respect to HBase, out of the >> box, you don’t have any auditing. >> >> > > HBase has auditing. By default it is disabled and it certainly could use > some improvement. Documentation would be a good start. I'm sure the > community would be happy to work with Joseph to close whatever gap he needs. > > > > >> You also don’t have built in encryption. >> You can do it, but then you have a bit of work ahead of you. >> Cell level encryption? Accumulo? >> >> > HBase as had encryption since within the 0.98 line. It is stable now in the > 1.0 release line. HDFS also supports encryption, though I'm sure using it > with the LocalFileSystem would benefit from testing. There are vendors that > can help with integration with proper key servers, if that is something > Joseph needs and doesn't want to do on his own. > > Accumulo does not do cell level encryption. > > > >> There’s definitely more to it. >> >> But the one killer thing… you need to be HIPPA compliant and the simplest >> way to do this is to use a real RDBMS. If you need extensibility, look at >> IDS from IBM (IBM bought Informix ages ago.) >> >> I think based on the size of your data… you can get away with the free >> version, and even if not, IBM does do discounts with Universities and could >> even sponsor research projects. >> >> I don’t know your data, but 10^6 rows is still small. >> >> The point I’m trying to make is that based on what you’ve said, HBase is >> definitely not the right database for you. >> >> > We haven't heard what the target data set size is. If Joseph has reason to > believe that it will be big enough to warrant something like HBase (e.g. > 10s of billions of rows), I think there's merit to his argument for > starting with HBase. Single node use cases are definitely not something > we've covered well to date, but it would probably help our overall > usability story to do so. > > > -- > Sean The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com