Last year, I used Accumulo's rapid ingest ability to join two data silos into one dataset. Every field was fully indexed. Having all of the data in one place allowed cross-referencing queries to be executed. For various reason, this kind of query was not possible using the existing technology. The rapid ingest was important because a new copy of the data silos was pulled every night.
On Thu, Jul 10, 2014 at 1:55 PM, Sean Busbey <[email protected]> wrote: > > Chuck, > > It would help the community and my own benchmarking efforts if you could > describe how you think a benchmark might incorporate representations of > real-world bottlenecks. > > Do you think YCSB sufficiently covers the kind of testing you'd prefer? > > > Marc, > > Similarly, it would help if you could describe the use case(s) behind your > statement of interest. > > -Sean > > > On Thu, Jul 10, 2014 at 12:11 PM, Marc Parisi <[email protected]> wrote: > >> I care >> >> >> On Thu, Jul 10, 2014 at 11:33 AM, Chuck Adams <[email protected]> >> wrote: >> >>> Dr. Kepner, >>> >>> Who cares how fast you can load data into a non-indexed HBase or >>> Accumulo database? What is the strategy to handle user queries against >>> this corpus? Run some real world tests which include simultaneous queries, >>> index maintenance, information life cycle management, during the initial >>> and incremental data loads. >>> >>> The test you are running do not appear to have any of the real world >>> bottlenecks that occur for production systems that users rely on for their >>> business. >>> >>> Respectfully, >>> Chuck Adams >>> >>> Vice President Technical Leadership Team >>> Oracle National Security Group >>> 1910 Oracle Way >>> Reston, VA 20190 >>> >>> Cell: 301.529.9396 >>> Email: [email protected] >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> >>> -----Original Message----- >>> From: Jeremy Kepner [mailto:[email protected]] >>> Sent: Thursday, July 10, 2014 11:08 AM >>> To: [email protected] >>> Subject: Re: How does Accumulo compare to HBase >>> >>> If you repeated the experiments you did for the Accumulo only portion >>> with HBase what would you expect the results to be? >>> >>> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote: >>> > At the risk of derailing the original thread, I'll take a moment to >>> > explain my methodology. Since each entry can be thought of as a 6 >>> > dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for >>> > fiddling with the specifics of it. YCSB gives you several knobs, but >>> > unfortunately, not absolutely everything was tunable. >>> > >>> > The things that are configurable: >>> > - # of rows >>> > - # of column qualifiers >>> > - length of value >>> > - number of operations per client >>> > - number of clients >>> > >>> > Things that are not configurable: >>> > - row length (it's a long int) >>> > - # of column families (constant at one per row) >>> > - length of column qualifier (basically a one-up counter per row) >>> > - visibilities >>> > >>> > In all of my experiments, the goal was to keep data size constant. >>> > This can be approximated by (number of entries * entry size). Number >>> > of entries is intuitively rows (configurable) * column families (1) * >>> > columns qualifiers per family (configurable), while entry size is key >>> > overhead (about 40 >>> > bytes) + configured length of value. So to keep total size constant, >>> > we have three easy knobs. However, tweaking three values at a time >>> > produces really messy data where you're not always going to be sure >>> > where the causality arrow lies. Even doing two at a time can cause >>> > issues but then the choice is between tweaking two properties of the >>> > data, or one property of the data and the total size (which is also a >>> relevant attribute). >>> > >>> > Whew. So why did I use two different independent variables between the >>> > two halves? >>> > >>> > Partly, because I'm not comparing the two tests to each other, so they >>> > don't have to be duplicative. I ran them on different hardware from >>> > each other, with different number of clients, disks, cores, etc. >>> > There's no meaningful comparisons to be drawn, so I wanted to remove >>> > the temptation to compare results against each other. I'll admit that >>> > I might be wrong in this regard. >>> > >>> > The graphs are not my complete data sets. For the Accumulo v Accumulo >>> > tests, we have about ten more data points varying rows and data size >>> > as well. Trying to show three independent variables on a graph was >>> > pretty painful, so they didn't make it into the presentation. The >>> > short version of the story is that nothing scaled linearly (some >>> > things were better, some things were worse) but the general trend >>> > lines were approximately what you would expect. >>> > >>> > Let me know if you have more questions, but we can probably start a >>> > new thread for future search posterity! (This applies to everybody). >>> > >>> > Mike >>> > >>> > >>> > On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL < >>> > [email protected]> wrote: >>> > >>> > > Mike Drob put together a great talk at the Accumulo Summit ( >>> > > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing >>> > > Accumulo performance and HBase performance. This exactly the kind >>> > > of work the entire Hadoop community needs to continue to move >>> forward. >>> > > >>> > > I had one question about the talk which I was wondering if someone >>> > > might be able to shed light on. In the Accumulo part of the talk >>> > > the experiments varied #rows while keeping the #cols fixed, while in >>> > > the Accumulo/HBase part of the the experiments varied #cols while >>> keeping #rows fixed? >>> >> >> > > > -- > Sean >
