I finally decided to make the Storage part agnostic and wrote a HBase adapter as well.
I'd like to say that Accumulo's API makes more sense than HBase'. One example is scan. In Accumulo, scan can take a Range which can take a startKey and endKey; a Key is a combination of RK+CF+CQ+CV+TS. This worked perfect and really free the CQ to a magic field in schema design. In HBase, scan can only take a startRow and an endRow... which both indicates the RK part. Scan through a range of Columns will require a ColumnRangeFilter AFAIK. (I heard HBase filter's performance sucks, not sure why). Any HBase expert disagrees? :) Jianshi On Fri, Jul 11, 2014 at 9:56 PM, Kepner, Jeremy - 0553 - MITLL < [email protected]> wrote: > In many communities real applications can be hard to come by. These > communities often sponsor the development of application/technology > agnostic benchmarks as a surrogate. YCBS is one example. Graph500 is > another example. This allows a technology developer to focus on their own > technology and not have to be in the awkward position of having to also > benchmark multiple technologies simultaneously. It's not a perfect system, > but in general, a few mediocre benchmarks tend to be a lot better than no > benchmarks (too many benchmarks is also a problem). The benchmarks also > help with communicating the work because you can just reference the > benchmark. > > Now the tricky part is educating the customer base in how to interpret the > benchmarks. To a certain degree this is simply the burden of being an > informed consumer, but we can do as much as we can to help them. Using > standard benchmarks and showing how they correlate with some applications > and don't correlate with other applications is our obligation. > > On Jul 11, 2014, at 1:05 AM, Josh Elser <[email protected]> wrote: > > > It's important to remember that YCSB is a benchmark designed to test a > specific workload across database systems. > > > > No benchmark is going to be representative of every "real-life" workload > that can be thought of. Understand what the benchmark is showing and draw > realistic conclusions about what the results are showing you WRT the > problem(s) you care about. > > > > On 7/10/14, 11:33 AM, Chuck Adams wrote: > >> Dr. Kepner, > >> > >> Who cares how fast you can load data into a non-indexed HBase or > Accumulo database? What is the strategy to handle user queries against > this corpus? Run some real world tests which include simultaneous queries, > index maintenance, information life cycle management, during the initial > and incremental data loads. > >> > >> The test you are running do not appear to have any of the real world > bottlenecks that occur for production systems that users rely on for their > business. > >> > >> Respectfully, > >> Chuck Adams > >> > >> Vice President Technical Leadership Team > >> Oracle National Security Group > >> 1910 Oracle Way > >> Reston, VA 20190 > >> > >> Cell: 301.529.9396 > >> Email: [email protected] > >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >> > >> > >> -----Original Message----- > >> From: Jeremy Kepner [mailto:[email protected]] > >> Sent: Thursday, July 10, 2014 11:08 AM > >> To: [email protected] > >> Subject: Re: How does Accumulo compare to HBase > >> > >> If you repeated the experiments you did for the Accumulo only portion > with HBase what would you expect the results to be? > >> > >> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote: > >>> At the risk of derailing the original thread, I'll take a moment to > >>> explain my methodology. Since each entry can be thought of as a 6 > >>> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for > >>> fiddling with the specifics of it. YCSB gives you several knobs, but > >>> unfortunately, not absolutely everything was tunable. > >>> > >>> The things that are configurable: > >>> - # of rows > >>> - # of column qualifiers > >>> - length of value > >>> - number of operations per client > >>> - number of clients > >>> > >>> Things that are not configurable: > >>> - row length (it's a long int) > >>> - # of column families (constant at one per row) > >>> - length of column qualifier (basically a one-up counter per row) > >>> - visibilities > >>> > >>> In all of my experiments, the goal was to keep data size constant. > >>> This can be approximated by (number of entries * entry size). Number > >>> of entries is intuitively rows (configurable) * column families (1) * > >>> columns qualifiers per family (configurable), while entry size is key > >>> overhead (about 40 > >>> bytes) + configured length of value. So to keep total size constant, > >>> we have three easy knobs. However, tweaking three values at a time > >>> produces really messy data where you're not always going to be sure > >>> where the causality arrow lies. Even doing two at a time can cause > >>> issues but then the choice is between tweaking two properties of the > >>> data, or one property of the data and the total size (which is also a > relevant attribute). > >>> > >>> Whew. So why did I use two different independent variables between the > >>> two halves? > >>> > >>> Partly, because I'm not comparing the two tests to each other, so they > >>> don't have to be duplicative. I ran them on different hardware from > >>> each other, with different number of clients, disks, cores, etc. > >>> There's no meaningful comparisons to be drawn, so I wanted to remove > >>> the temptation to compare results against each other. I'll admit that > >>> I might be wrong in this regard. > >>> > >>> The graphs are not my complete data sets. For the Accumulo v Accumulo > >>> tests, we have about ten more data points varying rows and data size > >>> as well. Trying to show three independent variables on a graph was > >>> pretty painful, so they didn't make it into the presentation. The > >>> short version of the story is that nothing scaled linearly (some > >>> things were better, some things were worse) but the general trend > >>> lines were approximately what you would expect. > >>> > >>> Let me know if you have more questions, but we can probably start a > >>> new thread for future search posterity! (This applies to everybody). > >>> > >>> Mike > >>> > >>> > >>> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL < > >>> [email protected]> wrote: > >>> > >>>> Mike Drob put together a great talk at the Accumulo Summit ( > >>>> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing > >>>> Accumulo performance and HBase performance. This exactly the kind > >>>> of work the entire Hadoop community needs to continue to move forward. > >>>> > >>>> I had one question about the talk which I was wondering if someone > >>>> might be able to shed light on. In the Accumulo part of the talk > >>>> the experiments varied #rows while keeping the #cols fixed, while in > >>>> the Accumulo/HBase part of the the experiments varied #cols while > keeping #rows fixed? > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
