It's important to remember that YCSB is a benchmark designed to test a specific workload across database systems.

No benchmark is going to be representative of every "real-life" workload that can be thought of. Understand what the benchmark is showing and draw realistic conclusions about what the results are showing you WRT the problem(s) you care about.

On 7/10/14, 11:33 AM, Chuck Adams wrote:
Dr. Kepner,

Who cares how fast you can load data into a non-indexed HBase or Accumulo 
database?  What is the strategy to handle user queries against this corpus?  
Run some real world tests which include simultaneous queries, index 
maintenance, information life cycle management, during the initial and 
incremental data loads.

The test you are running do not appear to have any of the real world 
bottlenecks that occur for production systems that users rely on for their 
business.

Respectfully,
Chuck Adams

Vice President Technical Leadership Team
Oracle National Security Group
1910 Oracle Way
Reston, VA 20190

Cell:     301.529.9396
Email: [email protected]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Jeremy Kepner [mailto:[email protected]]
Sent: Thursday, July 10, 2014 11:08 AM
To: [email protected]
Subject: Re: How does Accumulo compare to HBase

If you repeated the experiments you did for the Accumulo only portion with 
HBase what would you expect the results to be?

On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
At the risk of derailing the original thread, I'll take a moment to
explain my methodology. Since each entry can be thought of as a 6
dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
fiddling with the specifics of it. YCSB gives you several knobs, but
unfortunately, not absolutely everything was tunable.

The things that are configurable:
- # of rows
- # of column qualifiers
- length of value
- number of operations per client
- number of clients

Things that are not configurable:
- row length (it's a long int)
- # of column families (constant at one per row)
- length of column qualifier (basically a one-up counter per row)
- visibilities

In all of my experiments, the goal was to keep data size constant.
This can be approximated by (number of entries * entry size). Number
of entries is intuitively rows (configurable) * column families (1) *
columns qualifiers per family (configurable), while entry size is key
overhead (about 40
bytes) + configured length of value. So to keep total size constant,
we have three easy knobs. However, tweaking three values at a time
produces really messy data where you're not always going to be sure
where the causality arrow lies. Even doing two at a time can cause
issues but then the choice is between tweaking two properties of the
data, or one property of the data and the total size (which is also a relevant 
attribute).

Whew. So why did I use two different independent variables between the
two halves?

Partly, because I'm not comparing the two tests to each other, so they
don't have to be duplicative. I ran them on different hardware from
each other, with different number of clients, disks, cores, etc.
There's no meaningful comparisons to be drawn, so I wanted to remove
the temptation to compare results against each other. I'll admit that
I might be wrong in this regard.

The graphs are not my complete data sets. For the Accumulo v Accumulo
tests, we have about ten more data points varying rows and data size
as well. Trying to show three independent variables on a graph was
pretty painful, so they didn't make it into the presentation. The
short version of the story is that nothing scaled linearly (some
things were better, some things were worse) but the general trend
lines were approximately what you would expect.

Let me know if you have more questions, but we can probably start a
new thread for future search posterity! (This applies to everybody).

Mike


On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
[email protected]> wrote:

Mike Drob put together a great talk at the Accumulo Summit (
http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
Accumulo performance and HBase performance.  This exactly the kind
of work the entire Hadoop community needs to continue to move forward.

I had one question about the talk which I was wondering if someone
might be able to shed light on.  In the Accumulo part of the talk
the experiments varied #rows while keeping the #cols fixed, while in
the Accumulo/HBase part of the the experiments varied #cols while keeping #rows 
fixed?

Reply via email to