Re: How does Accumulo compare to HBase

Jianshi Huang Wed, 16 Jul 2014 23:49:24 -0700

I finally decided to make the Storage part agnostic and wrote a HBase
adapter as well.


I'd like to say that Accumulo's API makes more sense than HBase'.

One example is scan. In Accumulo, scan can take a Range which can take a
startKey and endKey; a Key is a combination of RK+CF+CQ+CV+TS. This worked
perfect and really free the CQ to a magic field in schema design.

In HBase, scan can only take a startRow and an endRow... which both
indicates the RK part. Scan through a range of Columns will require a
ColumnRangeFilter AFAIK. (I heard HBase filter's performance sucks, not
sure why).

Any HBase expert disagrees? :)

Jianshi




On Fri, Jul 11, 2014 at 9:56 PM, Kepner, Jeremy - 0553 - MITLL <
[email protected]> wrote:

> In many communities real applications can be hard to come by.  These
> communities often sponsor the development of application/technology
> agnostic benchmarks as a surrogate.  YCBS is one example.  Graph500 is
> another example.  This allows a technology developer to focus on their own
> technology and not have to be in the awkward position of having to also
> benchmark multiple technologies simultaneously.  It's not a perfect system,
> but in general, a few mediocre benchmarks tend to be a lot better than no
> benchmarks (too many benchmarks is also a problem).  The benchmarks also
> help with communicating the work because you can just reference the
> benchmark.
>
> Now the tricky part is educating the customer base in how to interpret the
> benchmarks.  To a certain degree this is simply the burden of being an
> informed consumer, but we can do as much as we can to help them.  Using
> standard benchmarks and showing how they correlate with some applications
> and don't correlate with other applications is our obligation.
>
> On Jul 11, 2014, at 1:05 AM, Josh Elser <[email protected]> wrote:
>
> > It's important to remember that YCSB is a benchmark designed to test a
> specific workload across database systems.
> >
> > No benchmark is going to be representative of every "real-life" workload
> that can be thought of. Understand what the benchmark is showing and draw
> realistic conclusions about what the results are showing you WRT the
> problem(s) you care about.
> >
> > On 7/10/14, 11:33 AM, Chuck Adams wrote:
> >> Dr. Kepner,
> >>
> >> Who cares how fast you can load data into a non-indexed HBase or
> Accumulo database?  What is the strategy to handle user queries against
> this corpus?  Run some real world tests which include simultaneous queries,
> index maintenance, information life cycle management, during the initial
> and incremental data loads.
> >>
> >> The test you are running do not appear to have any of the real world
> bottlenecks that occur for production systems that users rely on for their
> business.
> >>
> >> Respectfully,
> >> Chuck Adams
> >>
> >> Vice President Technical Leadership Team
> >> Oracle National Security Group
> >> 1910 Oracle Way
> >> Reston, VA 20190
> >>
> >> Cell:     301.529.9396
> >> Email: [email protected]
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>
> >>
> >> -----Original Message-----
> >> From: Jeremy Kepner [mailto:[email protected]]
> >> Sent: Thursday, July 10, 2014 11:08 AM
> >> To: [email protected]
> >> Subject: Re: How does Accumulo compare to HBase
> >>
> >> If you repeated the experiments you did for the Accumulo only portion
> with HBase what would you expect the results to be?
> >>
> >> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
> >>> At the risk of derailing the original thread, I'll take a moment to
> >>> explain my methodology. Since each entry can be thought of as a 6
> >>> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
> >>> fiddling with the specifics of it. YCSB gives you several knobs, but
> >>> unfortunately, not absolutely everything was tunable.
> >>>
> >>> The things that are configurable:
> >>> - # of rows
> >>> - # of column qualifiers
> >>> - length of value
> >>> - number of operations per client
> >>> - number of clients
> >>>
> >>> Things that are not configurable:
> >>> - row length (it's a long int)
> >>> - # of column families (constant at one per row)
> >>> - length of column qualifier (basically a one-up counter per row)
> >>> - visibilities
> >>>
> >>> In all of my experiments, the goal was to keep data size constant.
> >>> This can be approximated by (number of entries * entry size). Number
> >>> of entries is intuitively rows (configurable) * column families (1) *
> >>> columns qualifiers per family (configurable), while entry size is key
> >>> overhead (about 40
> >>> bytes) + configured length of value. So to keep total size constant,
> >>> we have three easy knobs. However, tweaking three values at a time
> >>> produces really messy data where you're not always going to be sure
> >>> where the causality arrow lies. Even doing two at a time can cause
> >>> issues but then the choice is between tweaking two properties of the
> >>> data, or one property of the data and the total size (which is also a
> relevant attribute).
> >>>
> >>> Whew. So why did I use two different independent variables between the
> >>> two halves?
> >>>
> >>> Partly, because I'm not comparing the two tests to each other, so they
> >>> don't have to be duplicative. I ran them on different hardware from
> >>> each other, with different number of clients, disks, cores, etc.
> >>> There's no meaningful comparisons to be drawn, so I wanted to remove
> >>> the temptation to compare results against each other. I'll admit that
> >>> I might be wrong in this regard.
> >>>
> >>> The graphs are not my complete data sets. For the Accumulo v Accumulo
> >>> tests, we have about ten more data points varying rows and data size
> >>> as well. Trying to show three independent variables on a graph was
> >>> pretty painful, so they didn't make it into the presentation. The
> >>> short version of the story is that nothing scaled linearly (some
> >>> things were better, some things were worse) but the general trend
> >>> lines were approximately what you would expect.
> >>>
> >>> Let me know if you have more questions, but we can probably start a
> >>> new thread for future search posterity! (This applies to everybody).
> >>>
> >>> Mike
> >>>
> >>>
> >>> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
> >>> [email protected]> wrote:
> >>>
> >>>> Mike Drob put together a great talk at the Accumulo Summit (
> >>>> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
> >>>> Accumulo performance and HBase performance.  This exactly the kind
> >>>> of work the entire Hadoop community needs to continue to move forward.
> >>>>
> >>>> I had one question about the talk which I was wondering if someone
> >>>> might be able to shed light on.  In the Accumulo part of the talk
> >>>> the experiments varied #rows while keeping the #cols fixed, while in
> >>>> the Accumulo/HBase part of the the experiments varied #cols while
> keeping #rows fixed?
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How does Accumulo compare to HBase

Reply via email to