Chuck, With regard to "the strategy to handle user queries against the corpus," we have implementing the secondary indexing concepts of D4M <http://www.mit.edu/~kepner/pubs/D4Mschema_HPEC2013_Paper.pdf> and have introduced distributed SQL queries on top of Accumulo. Here is a presentation for the recent Accumulo summit: http://www.slideshare.net/AccumuloSummit/14-25-navruzyan
Jeremy's research and benchmarks were instrumental to our efforts. Arshak Navruzyan | VP Product | Argyle Data, Inc. On Thu, Jul 10, 2014 at 8:33 AM, Chuck Adams <[email protected]> wrote: > Dr. Kepner, > > Who cares how fast you can load data into a non-indexed HBase or Accumulo > database? What is the strategy to handle user queries against this corpus? > Run some real world tests which include simultaneous queries, index > maintenance, information life cycle management, during the initial and > incremental data loads. > > The test you are running do not appear to have any of the real world > bottlenecks that occur for production systems that users rely on for their > business. > > Respectfully, > Chuck Adams > > Vice President Technical Leadership Team > Oracle National Security Group > 1910 Oracle Way > Reston, VA 20190 > > Cell: 301.529.9396 > Email: [email protected] > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > -----Original Message----- > From: Jeremy Kepner [mailto:[email protected]] > Sent: Thursday, July 10, 2014 11:08 AM > To: [email protected] > Subject: Re: How does Accumulo compare to HBase > > If you repeated the experiments you did for the Accumulo only portion with > HBase what would you expect the results to be? > > On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote: > > At the risk of derailing the original thread, I'll take a moment to > > explain my methodology. Since each entry can be thought of as a 6 > > dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for > > fiddling with the specifics of it. YCSB gives you several knobs, but > > unfortunately, not absolutely everything was tunable. > > > > The things that are configurable: > > - # of rows > > - # of column qualifiers > > - length of value > > - number of operations per client > > - number of clients > > > > Things that are not configurable: > > - row length (it's a long int) > > - # of column families (constant at one per row) > > - length of column qualifier (basically a one-up counter per row) > > - visibilities > > > > In all of my experiments, the goal was to keep data size constant. > > This can be approximated by (number of entries * entry size). Number > > of entries is intuitively rows (configurable) * column families (1) * > > columns qualifiers per family (configurable), while entry size is key > > overhead (about 40 > > bytes) + configured length of value. So to keep total size constant, > > we have three easy knobs. However, tweaking three values at a time > > produces really messy data where you're not always going to be sure > > where the causality arrow lies. Even doing two at a time can cause > > issues but then the choice is between tweaking two properties of the > > data, or one property of the data and the total size (which is also a > relevant attribute). > > > > Whew. So why did I use two different independent variables between the > > two halves? > > > > Partly, because I'm not comparing the two tests to each other, so they > > don't have to be duplicative. I ran them on different hardware from > > each other, with different number of clients, disks, cores, etc. > > There's no meaningful comparisons to be drawn, so I wanted to remove > > the temptation to compare results against each other. I'll admit that > > I might be wrong in this regard. > > > > The graphs are not my complete data sets. For the Accumulo v Accumulo > > tests, we have about ten more data points varying rows and data size > > as well. Trying to show three independent variables on a graph was > > pretty painful, so they didn't make it into the presentation. The > > short version of the story is that nothing scaled linearly (some > > things were better, some things were worse) but the general trend > > lines were approximately what you would expect. > > > > Let me know if you have more questions, but we can probably start a > > new thread for future search posterity! (This applies to everybody). > > > > Mike > > > > > > On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL < > > [email protected]> wrote: > > > > > Mike Drob put together a great talk at the Accumulo Summit ( > > > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing > > > Accumulo performance and HBase performance. This exactly the kind > > > of work the entire Hadoop community needs to continue to move forward. > > > > > > I had one question about the talk which I was wondering if someone > > > might be able to shed light on. In the Accumulo part of the talk > > > the experiments varied #rows while keeping the #cols fixed, while in > > > the Accumulo/HBase part of the the experiments varied #cols while > keeping #rows fixed? >
