If you repeated the experiments you did for the Accumulo only portion with HBase what would you expect the results to be?
On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote: > At the risk of derailing the original thread, I'll take a moment to explain > my methodology. Since each entry can be thought of as a 6 dimensional > vector (r, cf, cq, vis, ts, val) there's a lot of room for fiddling with > the specifics of it. YCSB gives you several knobs, but unfortunately, not > absolutely everything was tunable. > > The things that are configurable: > - # of rows > - # of column qualifiers > - length of value > - number of operations per client > - number of clients > > Things that are not configurable: > - row length (it's a long int) > - # of column families (constant at one per row) > - length of column qualifier (basically a one-up counter per row) > - visibilities > > In all of my experiments, the goal was to keep data size constant. This can > be approximated by (number of entries * entry size). Number of entries is > intuitively rows (configurable) * column families (1) * columns qualifiers > per family (configurable), while entry size is key overhead (about 40 > bytes) + configured length of value. So to keep total size constant, we > have three easy knobs. However, tweaking three values at a time produces > really messy data where you're not always going to be sure where the > causality arrow lies. Even doing two at a time can cause issues but then > the choice is between tweaking two properties of the data, or one property > of the data and the total size (which is also a relevant attribute). > > Whew. So why did I use two different independent variables between the two > halves? > > Partly, because I'm not comparing the two tests to each other, so they > don't have to be duplicative. I ran them on different hardware from each > other, with different number of clients, disks, cores, etc. There's no > meaningful comparisons to be drawn, so I wanted to remove the temptation to > compare results against each other. I'll admit that I might be wrong in > this regard. > > The graphs are not my complete data sets. For the Accumulo v Accumulo > tests, we have about ten more data points varying rows and data size as > well. Trying to show three independent variables on a graph was pretty > painful, so they didn't make it into the presentation. The short version of > the story is that nothing scaled linearly (some things were better, some > things were worse) but the general trend lines were approximately what you > would expect. > > Let me know if you have more questions, but we can probably start a new > thread for future search posterity! (This applies to everybody). > > Mike > > > On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL < > [email protected]> wrote: > > > Mike Drob put together a great talk at the Accumulo Summit ( > > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing Accumulo > > performance and HBase performance. This exactly the kind of work the > > entire Hadoop community needs to continue to move forward. > > > > I had one question about the talk which I was wondering if someone might > > be able to shed light on. In the Accumulo part of the talk the experiments > > varied #rows while keeping the #cols fixed, while in the Accumulo/HBase > > part of the the experiments varied #cols while keeping #rows fixed?
