W.r.t. HBase filter's performance, can you let us know the source of the information ? Is performance bad for all filters or some types of filters ?
Cheers On Jul 16, 2014, at 11:48 PM, Jianshi Huang <[email protected]> wrote: > I finally decided to make the Storage part agnostic and wrote a HBase adapter > as well. > > I'd like to say that Accumulo's API makes more sense than HBase'. > > One example is scan. In Accumulo, scan can take a Range which can take a > startKey and endKey; a Key is a combination of RK+CF+CQ+CV+TS. This worked > perfect and really free the CQ to a magic field in schema design. > > In HBase, scan can only take a startRow and an endRow... which both indicates > the RK part. Scan through a range of Columns will require a ColumnRangeFilter > AFAIK. (I heard HBase filter's performance sucks, not sure why). > > Any HBase expert disagrees? :) > > Jianshi > > > > > On Fri, Jul 11, 2014 at 9:56 PM, Kepner, Jeremy - 0553 - MITLL > <[email protected]> wrote: >> In many communities real applications can be hard to come by. These >> communities often sponsor the development of application/technology agnostic >> benchmarks as a surrogate. YCBS is one example. Graph500 is another >> example. This allows a technology developer to focus on their own >> technology and not have to be in the awkward position of having to also >> benchmark multiple technologies simultaneously. It's not a perfect system, >> but in general, a few mediocre benchmarks tend to be a lot better than no >> benchmarks (too many benchmarks is also a problem). The benchmarks also >> help with communicating the work because you can just reference the >> benchmark. >> >> Now the tricky part is educating the customer base in how to interpret the >> benchmarks. To a certain degree this is simply the burden of being an >> informed consumer, but we can do as much as we can to help them. Using >> standard benchmarks and showing how they correlate with some applications >> and don't correlate with other applications is our obligation. >> >> On Jul 11, 2014, at 1:05 AM, Josh Elser <[email protected]> wrote: >> >> > It's important to remember that YCSB is a benchmark designed to test a >> > specific workload across database systems. >> > >> > No benchmark is going to be representative of every "real-life" workload >> > that can be thought of. Understand what the benchmark is showing and draw >> > realistic conclusions about what the results are showing you WRT the >> > problem(s) you care about. >> > >> > On 7/10/14, 11:33 AM, Chuck Adams wrote: >> >> Dr. Kepner, >> >> >> >> Who cares how fast you can load data into a non-indexed HBase or Accumulo >> >> database? What is the strategy to handle user queries against this >> >> corpus? Run some real world tests which include simultaneous queries, >> >> index maintenance, information life cycle management, during the initial >> >> and incremental data loads. >> >> >> >> The test you are running do not appear to have any of the real world >> >> bottlenecks that occur for production systems that users rely on for >> >> their business. >> >> >> >> Respectfully, >> >> Chuck Adams >> >> >> >> Vice President Technical Leadership Team >> >> Oracle National Security Group >> >> 1910 Oracle Way >> >> Reston, VA 20190 >> >> >> >> Cell: 301.529.9396 >> >> Email: [email protected] >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> >> >> >> >> -----Original Message----- >> >> From: Jeremy Kepner [mailto:[email protected]] >> >> Sent: Thursday, July 10, 2014 11:08 AM >> >> To: [email protected] >> >> Subject: Re: How does Accumulo compare to HBase >> >> >> >> If you repeated the experiments you did for the Accumulo only portion >> >> with HBase what would you expect the results to be? >> >> >> >> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote: >> >>> At the risk of derailing the original thread, I'll take a moment to >> >>> explain my methodology. Since each entry can be thought of as a 6 >> >>> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for >> >>> fiddling with the specifics of it. YCSB gives you several knobs, but >> >>> unfortunately, not absolutely everything was tunable. >> >>> >> >>> The things that are configurable: >> >>> - # of rows >> >>> - # of column qualifiers >> >>> - length of value >> >>> - number of operations per client >> >>> - number of clients >> >>> >> >>> Things that are not configurable: >> >>> - row length (it's a long int) >> >>> - # of column families (constant at one per row) >> >>> - length of column qualifier (basically a one-up counter per row) >> >>> - visibilities >> >>> >> >>> In all of my experiments, the goal was to keep data size constant. >> >>> This can be approximated by (number of entries * entry size). Number >> >>> of entries is intuitively rows (configurable) * column families (1) * >> >>> columns qualifiers per family (configurable), while entry size is key >> >>> overhead (about 40 >> >>> bytes) + configured length of value. So to keep total size constant, >> >>> we have three easy knobs. However, tweaking three values at a time >> >>> produces really messy data where you're not always going to be sure >> >>> where the causality arrow lies. Even doing two at a time can cause >> >>> issues but then the choice is between tweaking two properties of the >> >>> data, or one property of the data and the total size (which is also a >> >>> relevant attribute). >> >>> >> >>> Whew. So why did I use two different independent variables between the >> >>> two halves? >> >>> >> >>> Partly, because I'm not comparing the two tests to each other, so they >> >>> don't have to be duplicative. I ran them on different hardware from >> >>> each other, with different number of clients, disks, cores, etc. >> >>> There's no meaningful comparisons to be drawn, so I wanted to remove >> >>> the temptation to compare results against each other. I'll admit that >> >>> I might be wrong in this regard. >> >>> >> >>> The graphs are not my complete data sets. For the Accumulo v Accumulo >> >>> tests, we have about ten more data points varying rows and data size >> >>> as well. Trying to show three independent variables on a graph was >> >>> pretty painful, so they didn't make it into the presentation. The >> >>> short version of the story is that nothing scaled linearly (some >> >>> things were better, some things were worse) but the general trend >> >>> lines were approximately what you would expect. >> >>> >> >>> Let me know if you have more questions, but we can probably start a >> >>> new thread for future search posterity! (This applies to everybody). >> >>> >> >>> Mike >> >>> >> >>> >> >>> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL < >> >>> [email protected]> wrote: >> >>> >> >>>> Mike Drob put together a great talk at the Accumulo Summit ( >> >>>> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing >> >>>> Accumulo performance and HBase performance. This exactly the kind >> >>>> of work the entire Hadoop community needs to continue to move forward. >> >>>> >> >>>> I had one question about the talk which I was wondering if someone >> >>>> might be able to shed light on. In the Accumulo part of the talk >> >>>> the experiments varied #rows while keeping the #cols fixed, while in >> >>>> the Accumulo/HBase part of the the experiments varied #cols while >> >>>> keeping #rows fixed? > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/
