Re: How does Accumulo compare to HBase

Ted Yu Thu, 17 Jul 2014 01:10:30 -0700

W.r.t. HBase filter's performance, can you let us know the source of the 
information ?
Is performance bad for all filters or some types of filters ?


Cheers

On Jul 16, 2014, at 11:48 PM, Jianshi Huang <[email protected]> wrote:

> I finally decided to make the Storage part agnostic and wrote a HBase adapter 
> as well.
> 
> I'd like to say that Accumulo's API makes more sense than HBase'. 
> 
> One example is scan. In Accumulo, scan can take a Range which can take a 
> startKey and endKey; a Key is a combination of RK+CF+CQ+CV+TS. This worked 
> perfect and really free the CQ to a magic field in schema design.
> 
> In HBase, scan can only take a startRow and an endRow... which both indicates 
> the RK part. Scan through a range of Columns will require a ColumnRangeFilter 
> AFAIK. (I heard HBase filter's performance sucks, not sure why).
> 
> Any HBase expert disagrees? :)
> 
> Jianshi
> 
> 
> 
> 
> On Fri, Jul 11, 2014 at 9:56 PM, Kepner, Jeremy - 0553 - MITLL 
> <[email protected]> wrote:
>> In many communities real applications can be hard to come by.  These 
>> communities often sponsor the development of application/technology agnostic 
>> benchmarks as a surrogate.  YCBS is one example.  Graph500 is another 
>> example.  This allows a technology developer to focus on their own 
>> technology and not have to be in the awkward position of having to also 
>> benchmark multiple technologies simultaneously.  It's not a perfect system, 
>> but in general, a few mediocre benchmarks tend to be a lot better than no 
>> benchmarks (too many benchmarks is also a problem).  The benchmarks also 
>> help with communicating the work because you can just reference the 
>> benchmark.
>> 
>> Now the tricky part is educating the customer base in how to interpret the 
>> benchmarks.  To a certain degree this is simply the burden of being an 
>> informed consumer, but we can do as much as we can to help them.  Using 
>> standard benchmarks and showing how they correlate with some applications 
>> and don't correlate with other applications is our obligation.
>> 
>> On Jul 11, 2014, at 1:05 AM, Josh Elser <[email protected]> wrote:
>> 
>> > It's important to remember that YCSB is a benchmark designed to test a 
>> > specific workload across database systems.
>> >
>> > No benchmark is going to be representative of every "real-life" workload 
>> > that can be thought of. Understand what the benchmark is showing and draw 
>> > realistic conclusions about what the results are showing you WRT the 
>> > problem(s) you care about.
>> >
>> > On 7/10/14, 11:33 AM, Chuck Adams wrote:
>> >> Dr. Kepner,
>> >>
>> >> Who cares how fast you can load data into a non-indexed HBase or Accumulo 
>> >> database?  What is the strategy to handle user queries against this 
>> >> corpus?  Run some real world tests which include simultaneous queries, 
>> >> index maintenance, information life cycle management, during the initial 
>> >> and incremental data loads.
>> >>
>> >> The test you are running do not appear to have any of the real world 
>> >> bottlenecks that occur for production systems that users rely on for 
>> >> their business.
>> >>
>> >> Respectfully,
>> >> Chuck Adams
>> >>
>> >> Vice President Technical Leadership Team
>> >> Oracle National Security Group
>> >> 1910 Oracle Way
>> >> Reston, VA 20190
>> >>
>> >> Cell:     301.529.9396
>> >> Email: [email protected]
>> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Jeremy Kepner [mailto:[email protected]]
>> >> Sent: Thursday, July 10, 2014 11:08 AM
>> >> To: [email protected]
>> >> Subject: Re: How does Accumulo compare to HBase
>> >>
>> >> If you repeated the experiments you did for the Accumulo only portion 
>> >> with HBase what would you expect the results to be?
>> >>
>> >> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
>> >>> At the risk of derailing the original thread, I'll take a moment to
>> >>> explain my methodology. Since each entry can be thought of as a 6
>> >>> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
>> >>> fiddling with the specifics of it. YCSB gives you several knobs, but
>> >>> unfortunately, not absolutely everything was tunable.
>> >>>
>> >>> The things that are configurable:
>> >>> - # of rows
>> >>> - # of column qualifiers
>> >>> - length of value
>> >>> - number of operations per client
>> >>> - number of clients
>> >>>
>> >>> Things that are not configurable:
>> >>> - row length (it's a long int)
>> >>> - # of column families (constant at one per row)
>> >>> - length of column qualifier (basically a one-up counter per row)
>> >>> - visibilities
>> >>>
>> >>> In all of my experiments, the goal was to keep data size constant.
>> >>> This can be approximated by (number of entries * entry size). Number
>> >>> of entries is intuitively rows (configurable) * column families (1) *
>> >>> columns qualifiers per family (configurable), while entry size is key
>> >>> overhead (about 40
>> >>> bytes) + configured length of value. So to keep total size constant,
>> >>> we have three easy knobs. However, tweaking three values at a time
>> >>> produces really messy data where you're not always going to be sure
>> >>> where the causality arrow lies. Even doing two at a time can cause
>> >>> issues but then the choice is between tweaking two properties of the
>> >>> data, or one property of the data and the total size (which is also a 
>> >>> relevant attribute).
>> >>>
>> >>> Whew. So why did I use two different independent variables between the
>> >>> two halves?
>> >>>
>> >>> Partly, because I'm not comparing the two tests to each other, so they
>> >>> don't have to be duplicative. I ran them on different hardware from
>> >>> each other, with different number of clients, disks, cores, etc.
>> >>> There's no meaningful comparisons to be drawn, so I wanted to remove
>> >>> the temptation to compare results against each other. I'll admit that
>> >>> I might be wrong in this regard.
>> >>>
>> >>> The graphs are not my complete data sets. For the Accumulo v Accumulo
>> >>> tests, we have about ten more data points varying rows and data size
>> >>> as well. Trying to show three independent variables on a graph was
>> >>> pretty painful, so they didn't make it into the presentation. The
>> >>> short version of the story is that nothing scaled linearly (some
>> >>> things were better, some things were worse) but the general trend
>> >>> lines were approximately what you would expect.
>> >>>
>> >>> Let me know if you have more questions, but we can probably start a
>> >>> new thread for future search posterity! (This applies to everybody).
>> >>>
>> >>> Mike
>> >>>
>> >>>
>> >>> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Mike Drob put together a great talk at the Accumulo Summit (
>> >>>> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
>> >>>> Accumulo performance and HBase performance.  This exactly the kind
>> >>>> of work the entire Hadoop community needs to continue to move forward.
>> >>>>
>> >>>> I had one question about the talk which I was wondering if someone
>> >>>> might be able to shed light on.  In the Accumulo part of the talk
>> >>>> the experiments varied #rows while keeping the #cols fixed, while in
>> >>>> the Accumulo/HBase part of the the experiments varied #cols while 
>> >>>> keeping #rows fixed?
> 
> 
> 
> -- 
> Jianshi Huang
> 
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/

Re: How does Accumulo compare to HBase

Reply via email to