Re: How does Accumulo compare to HBase

Arshak Navruzyan Thu, 10 Jul 2014 23:27:07 -0700

Chuck,

With regard to "the strategy to handle user queries against the corpus," we
have implementing the secondary indexing concepts of D4M
<http://www.mit.edu/~kepner/pubs/D4Mschema_HPEC2013_Paper.pdf> and have
introduced distributed SQL queries on top of Accumulo.  Here is a
presentation for the recent Accumulo summit:
http://www.slideshare.net/AccumuloSummit/14-25-navruzyan


Jeremy's research and benchmarks were instrumental to our efforts.

Arshak Navruzyan | VP Product | Argyle Data, Inc.




On Thu, Jul 10, 2014 at 8:33 AM, Chuck Adams <[email protected]> wrote:

> Dr. Kepner,
>
> Who cares how fast you can load data into a non-indexed HBase or Accumulo
> database?  What is the strategy to handle user queries against this corpus?
>  Run some real world tests which include simultaneous queries, index
> maintenance, information life cycle management, during the initial and
> incremental data loads.
>
> The test you are running do not appear to have any of the real world
> bottlenecks that occur for production systems that users rely on for their
> business.
>
> Respectfully,
> Chuck Adams
>
> Vice President Technical Leadership Team
> Oracle National Security Group
> 1910 Oracle Way
> Reston, VA 20190
>
> Cell:     301.529.9396
> Email: [email protected]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> -----Original Message-----
> From: Jeremy Kepner [mailto:[email protected]]
> Sent: Thursday, July 10, 2014 11:08 AM
> To: [email protected]
> Subject: Re: How does Accumulo compare to HBase
>
> If you repeated the experiments you did for the Accumulo only portion with
> HBase what would you expect the results to be?
>
> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
> > At the risk of derailing the original thread, I'll take a moment to
> > explain my methodology. Since each entry can be thought of as a 6
> > dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
> > fiddling with the specifics of it. YCSB gives you several knobs, but
> > unfortunately, not absolutely everything was tunable.
> >
> > The things that are configurable:
> > - # of rows
> > - # of column qualifiers
> > - length of value
> > - number of operations per client
> > - number of clients
> >
> > Things that are not configurable:
> > - row length (it's a long int)
> > - # of column families (constant at one per row)
> > - length of column qualifier (basically a one-up counter per row)
> > - visibilities
> >
> > In all of my experiments, the goal was to keep data size constant.
> > This can be approximated by (number of entries * entry size). Number
> > of entries is intuitively rows (configurable) * column families (1) *
> > columns qualifiers per family (configurable), while entry size is key
> > overhead (about 40
> > bytes) + configured length of value. So to keep total size constant,
> > we have three easy knobs. However, tweaking three values at a time
> > produces really messy data where you're not always going to be sure
> > where the causality arrow lies. Even doing two at a time can cause
> > issues but then the choice is between tweaking two properties of the
> > data, or one property of the data and the total size (which is also a
> relevant attribute).
> >
> > Whew. So why did I use two different independent variables between the
> > two halves?
> >
> > Partly, because I'm not comparing the two tests to each other, so they
> > don't have to be duplicative. I ran them on different hardware from
> > each other, with different number of clients, disks, cores, etc.
> > There's no meaningful comparisons to be drawn, so I wanted to remove
> > the temptation to compare results against each other. I'll admit that
> > I might be wrong in this regard.
> >
> > The graphs are not my complete data sets. For the Accumulo v Accumulo
> > tests, we have about ten more data points varying rows and data size
> > as well. Trying to show three independent variables on a graph was
> > pretty painful, so they didn't make it into the presentation. The
> > short version of the story is that nothing scaled linearly (some
> > things were better, some things were worse) but the general trend
> > lines were approximately what you would expect.
> >
> > Let me know if you have more questions, but we can probably start a
> > new thread for future search posterity! (This applies to everybody).
> >
> > Mike
> >
> >
> > On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
> > [email protected]> wrote:
> >
> > > Mike Drob put together a great talk at the Accumulo Summit (
> > > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
> > > Accumulo performance and HBase performance.  This exactly the kind
> > > of work the entire Hadoop community needs to continue to move forward.
> > >
> > > I had one question about the talk which I was wondering if someone
> > > might be able to shed light on.  In the Accumulo part of the talk
> > > the experiments varied #rows while keeping the #cols fixed, while in
> > > the Accumulo/HBase part of the the experiments varied #cols while
> keeping #rows fixed?
>

Re: How does Accumulo compare to HBase

Reply via email to