Michael, Thanks for your concern. Let me ask a few questions, since you’re implying that HDFS is the only way to reduce risk and ensure security, which is not the assumption under which I’ve been working.
A brief rundown of our problem’s characteristics, since I haven’t really described what we’re doing: * We’re read heavy, write light. It’s likely we’ll do a large import of the data and update less than 0.1% per day. * The dataset isn’t huge, at the moment (it will likely become huge in the future.) If I were to go the RDBMS route I’d guess it could all fit on a dual core i5 machine with 2G memory and a quarter terabyte disk — and that might be over spec’d. What we’re doing is functional and solves a certain problem but is also a prototype for a much larger dataset. * We do need security, you’re absolutely right, and the data is subject to HIPPA. * Availability should be good but we don’t have to go overboard. A couple of nines would be just fine. * We plan on running this on a fairly small VM. The VM will be backed up nightly. So, with that in mind, let me make sure I’ve got this right. Your main points were data loss and security. As I understand it, HDFS might be the right choice for dozens of terabytes to petabyte scale (where it effectively becomes impossible to do a clean backup, since the odds of a undetected, hardware-level error during replication are not insignificant, even if you can find enough space.) But we’re talking gigs — easily & reliably replicated (I do it on my home machine all the time.) And since it looks like HBase has a stable file system after committing mutations, shutting down changes, doing a backup & re-enabling mutations seem like a fine choice. Do you see a hole with this approach? As for security, and as I understand it, HBase’s security model — both for tagging and encryption -- is built into the database layer, not HDFS. We very much want cell-level security with roles (because HIPPA) and encryption (also because HIPPA) but I don’t think that has anything to do with the underlying filesystem. Again, is there something here I’ve missed? When we get to 10^6+ rows we will probably build out a small cluster. We’re well below that threshold at the moment but will get there soon enough. -j On 3/13/15, 1:46 PM, "Michael Segel" <[email protected]> wrote: >Guys, > >More than just needing some love. >No HDFS… means data at risk. >No HDFS… means that stand alone will have security issues. > >Patient Data? HINT: HIPPA. > >Please think your design through and if you go w HBase… you will want to >build out a small cluster. > >> On Mar 10, 2015, at 6:16 PM, Nick Dimiduk <[email protected]> wrote: >> >> As Stack and Andrew said, just wanted to give you fair warning that this >> mode may need some love. Likewise, there are probably alternative that >>run >> a bit lighter weight, though you flatter us with the reminder of the >>long >> feature list. >> >> I have no problem with helping to fix and committing fixes to bugs that >> crop up in local mode operations. Bring 'em on! >> >> -n >> >> On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <[email protected]> >> wrote: >> >>> On: >>> >>> - Future investment in a design that scales better >>> >>> Indeed, designing against key value store is different from designing >>> against RDBMs. >>> >>> I wonder if you explored an option to abstract the storage layer and >>>using >>> "single node purposed" store until you grow enough to switch to another >>> one? >>> >>> E.g. you could use LevelDB [1] that is pretty fast (and there's java >>> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a >>> standalone version to make the development environment (SDK) lighter. >>>We >>> swap it with HBase in distributed mode without changing the application >>> code. It doesn't have coprocessors and other specific to HBase >>>features you >>> are talking about, though. But you can figure out how to bridge client >>>APIs >>> with an abstraction layer (e.g. we have common Table interface [4]). >>>You >>> can even add versions on cells (see [5] for example of how we do it). >>> >>> Also, you could use RDBMs behind key-value abstraction, to start with, >>> while keeping your app design clean out of RDBMs specifics. >>> >>> Alex Baranau >>> >>> [1] >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l >>>eveldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjE >>>n0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo- >>>CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=WRQk8xqNYxyT3htTfBna2R_9bgKJZPB4tDyItgU >>>qwJI&e= >>> [2] >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev >>>eldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0 >>>B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CC >>>J7rLU2XLh5RjJJOjub8v2AQzbZLo&s=YwiXrLkihDEPAbXTcIvLzRjYn7nT3DcOJRsuvpIwm >>>G0&e= >>> [3] >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q >>>S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3 >>>7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju >>>b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e= >>> [4] >>> >>> >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata >>>_cdap_blob_develop_cdap-2Dapi_src_main_java_co_cask_cdap_api_dataset_tab >>>le_Table.java&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j >>>9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1Lntz >>>oxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=oMAOmpbfDimKx4TUp0xhVpWtww0oZ6Ar >>>Udol-UzgmFg&e= >>> [5] >>> >>> >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata >>>_cdap_blob_develop_cdap-2Ddata-2Dfabric_src_main_java_co_cask_cdap_data2 >>>_dataset2_lib_table_leveldb_LevelDBTableCore.java&d=BQIFaQ&c=qS4goWBT7po >>>plM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05 >>>fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZL >>>o&s=3Fvtru1ABs6pL4sh0sE8Z-xPyy-m-GoqEWhyOHp3e-c&e= >>> >>> -- >>> >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q >>>S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3 >>>7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju >>>b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e= - open >>>source framework to build and run data applications >>> on Hadoop & HBase >>> >>> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph < >>> [email protected]> wrote: >>> >>>> Sorry, never answered your question about versions. I have 1.0.0 >>>>version >>>> of hbase, which has hadoop-common 2.5.1 in its lib folder. >>>> >>>> >>>> -j >>>> >>>> >>>> On 3/10/15, 11:36 AM, "Rose, Joseph" >>>><[email protected]> >>>> wrote: >>>> >>>>> I tried it and it does work now. It looks like the interface for >>>>> hadoop.fs.Syncable changed in March, 2012 to remove the deprecated >>> sync() >>>>> method and define only hsync() instead. The same committer did the >>>>>right >>>>> thing and removed sync() from FSDataOutputStream at the same time. >>>>>The >>>>> remaining hsync() method calls flush() if the underlying stream >>>>>doesn't >>>>> implement Syncable. >>>>> >>>>> >>>>> -j >>>>> >>>>> >>>>> On 3/6/15, 5:24 PM, "Stack" <[email protected]> wrote: >>>>> >>>>>> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I think the final issue with hadoop-common (re: unimplemented sync >>> for >>>>>>> local filesystems) is the one showstopper for us. We have to have >>>>>>> assured >>>>>>> durability. I¹m willing to devote some cycles to get it done, so >>> maybe >>>>>>> I¹m >>>>>>> the one that says this problem is worthwhile. >>>>>>> >>>>>>> >>>>>> I remember that was once the case but looking in codebase now, sync >>> calls >>>>>> through to ProtobufLogWriter which does a 'flush' on output (though >>>>>> comment >>>>>> says this is a noop). The output stream is an instance of >>>>>> FSDataOutputStream made with a RawLOS. The flush should come out >>>>>>here: >>>>>> >>>>>> 220 public void flush() throws IOException { fos.flush(); } >>>>>> >>>>>> ... where fos is an instance of FileOutputStream. >>>>>> >>>>>> In sync we go on to call hflush which looks like it calls flush >>>>>>again. >>>>>> >>>>>> What hadoop/hbase versions we talking about? HADOOP-8861 added the >>> above >>>>>> behavior for hadoop 1.2. >>>>>> >>>>>> Try it I'd say. >>>>>> >>>>>> St.Ack >>>>> >>>> >>>> >>> > >The opinions expressed here are mine, while they may reflect a cognitive >thought, that is purely accidental. >Use at your own risk. >Michael Segel >michael_segel (AT) hotmail.com > > > > >
