Cool, thanks for the report, Ben. For what it's worth, I think there's still some low hanging fruit in the Spark connector for Kudu (for example, I believe locality on reads is currently broken). So, you can expect performance to continue to improve in future versions. I'd also be interested to see results on Kudu for a much larger dataset - my guess is a lot of the 6 seconds you're seeing is constant overhead from Spark job setup, etc, given that the performance doesn't seem to get slower as you went from 700K rows to 13M rows.
-Todd On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > FYI. > > I did a quick-n-dirty performance test. > > First, the setup: > QA cluster: > > - 15 data nodes > - 64GB memory each > - HBase is using 4GB of memory > - Kudu is using 1GB of memory > - 1 HBase/Kudu master node > - 64GB memory > - HBase/Kudu master is using 1GB of memory each > - 10Gb Ethernet > > > Using Spark on both to load/read events data (84 columns per row), I was > able to record performance for each. On the HBase side, I used the Phoenix > 4.7 Spark plugin where DataFrames can be used directly. On the Kudu side, I > used the Spark connector. I created an events table in Phoenix using the > CREATE TABLE statement and created the equivalent in Kudu using the Spark > method based off of a DataFrame schema. > > Here are the numbers for Phoenix/HBase. > 1st run: > > 715k rows > - write: 2.7m > > > 715k rows in HBase table > - read: 0.1s > - count: 3.8s > - aggregate: 61s > > 2nd run: > > 5.2M rows > - write: 11m > * had 4 region servers go down, had to retry the 5.2M row write > > > 5.9M rows in HBase table > - read: 8s > - count: 3m > - aggregate: 46s > > 3rd run: > > 6.8M rows > - write: 9.6m > > > 12.7M rows > - read: 10s > - count: 3m > - aggregate: 44s > > > Here are the numbers for Kudu. > 1st run: > > 715k rows > - write: 18s > > > 715k rows in Kudu table > - read: 0.2s > - count: 18s > - aggregate: 5s > > 2nd run: > > 5.2M rows > - write: 33s > > > 5.9M rows in Kudu table > - read: 0.2s > - count: 16s > - aggregate: 6s > > 3rd run: > > 6.8M rows > - write: 27s > > > 12.7M rows in Kudu table > - read: 0.2s > - count: 16s > - aggregate: 6s > > The Kudu results are impressive if you take these number as-is. Kudu is > close to 18x faster at writing (UPSERT). Kudu is 30x faster at reading > (HBase times increase as data size grows). Kudu is 7x faster at full row > counts. Lastly, Kudu is 3x faster doing an aggregate query (count distinct > event_id’s per user_id). *Remember that this is small cluster, times are > still respectable for both systems, HBase could have been configured > better, and the HBase table could have been better tuned. > > Cheers, > Ben > > > On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com> wrote: > > Adding partition splits when range partitioning is done via the > CreateTableOptions.addSplitRow > <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow-> > method. > You can find more about the different partitioning options in the schema > design guide <http://getkudu.io/docs/schema_design.html#data-distribution>. > We generally recommend sticking to hash partitioning if possible, since you > don't have to determine your own split rows. > > - Dan > > On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> Todd, >> >> I think the locality is not within our setup. We have the compute cluster >> with Spark, YARN, etc. on its own, and we have the storage cluster with >> HBase, Kudu, etc. on another. We beefed up the hardware specs on the >> compute cluster and beefed up storage capacity on the storage cluster. We >> got this setup idea from the Databricks folks. I do have a question. I >> created the table to use range partition on columns. I see that if I use >> hash partition I can set the number of splits, but how do I do that using >> range (50 nodes * 10 = 500 splits)? >> >> Thanks, >> Ben >> >> >> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >> Awesome use case. One thing to keep in mind is that spark parallelism >> will be limited by the number of tablets. So, you might want to split into >> 10 or so buckets per node to get the best query throughput. >> >> Usually if you run top on some machines while running the query you can >> see if it is fully utilizing the cores. >> >> Another known issue right now is that spark locality isn't working >> properly on replicated tables so you will use a lot of network traffic. For >> a perf test you might want to try a table with replication count 1 >> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >> >> Hi Todd, >> >> I did a simple test of our ad events. We stream using Spark Streaming >> directly into HBase, and the Data Analysts/Scientists do some >> insight/discovery work plus some reports generation. For the reports, we >> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data >> currency store of choice is DataFrames. >> >> The schema is around 83 columns wide where most are of the string data >> type. >> >> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", >> "user_id", "mappable_id", >> "cookie_status", "profile_status", "user_status", "previous_timestamp", >> "user_agent", "referer", >> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", >> "creative_id", >> "location_id", “pcamp_id", >> "pdomain_id", "continent_code", "country", "region", "dma", "city", >> "zip", "isp", "line_speed", >> "gender", "year_of_birth", "behaviors_read", "behaviors_written", >> "key_value_pairs", "acamp_candidates", >> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", >> "pixel_id", “video_id", >> "video_network_id", "video_time_watched", "video_percentage_watched", >> "video_media_type", >> "video_player_iframed", "video_player_in_view", "video_player_width", >> "video_player_height", >> "conversion_valid_sale", "conversion_sale_amount", >> "conversion_commission_amount", "conversion_step", >> "conversion_currency", "conversion_attribution", "conversion_offer_id", >> "custom_info", "frequency", >> "recency_seconds", "cost", "revenue", “optimizer_acamp_id", >> "optimizer_creative_id", "optimizer_ecpm", "impression_id", >> "diagnostic_data", >> "user_profile_mapping_source", "latitude", "longitude", "area_code", >> "gmt_offset", "in_dst", >> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", >> "timestamp_iso", "reference_id", >> "identity_organization", "identity_method" >> >> Most queries are like counts of how many users use what browser, how many >> are unique users, etc. The part that scares most users is when it comes to >> joining this data with other dimension/3rd party events tables because of >> shear size of it. >> >> We do what most companies do, similar to what I saw in earlier >> presentations of Kudu. We dump data out of HBase into partitioned Parquet >> tables to make query performance manageable. >> >> I will coordinate with a data scientist today to do some tests. He is >> working on identity matching/record linking of users from 2 domains: US and >> Singapore, using probabilistic deduping algorithms. I will load the data >> from ad events from both countries, and let him run his process against >> this data in Kudu. I hope this will “wow” the team. >> >> Thanks, >> Ben >> >> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >> Hi Benjamin, >> >> What workload are you using for benchmarks? Using spark or something more >> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and >> some queries >> >> Todd >> >> Todd >> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >> >>> Hi Todd, >>> >>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am >>> impressed. Compared to HBase, read and write performance are better. Write >>> performance has the greatest improvement (> 4x), while read is > 1.5x. >>> Albeit, these are only preliminary tests. Do you know of a way to really do >>> some conclusive tests? I want to see if I can match your results on my 50 >>> node cluster. >>> >>> Thanks, >>> Ben >>> >>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote: >>> >>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com> >>> wrote: >>> >>>> Todd, >>>> >>>> It sounds like Kudu can possibly top or match those numbers put out by >>>> Aerospike. Do you have any performance statistics published or any >>>> instructions as to measure them myself as good way to test? In addition, >>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 >>>> where support will be built in? >>>> >>> >>> We don't have a lot of benchmarks published yet, especially on the write >>> side. I've found that thorough cross-system benchmarks are very difficult >>> to do fairly and accurately, and often times users end up misguided if they >>> pay too much attention to them :) So, given a finite number of developers >>> working on Kudu, I think we've tended to spend more time on the project >>> itself and less time focusing on "competition". I'm sure there are use >>> cases where Kudu will beat out Aerospike, and probably use cases where >>> Aerospike will beat Kudu as well. >>> >>> From my perspective, it would be great if you can share some details of >>> your workload, especially if there are some areas you're finding Kudu >>> lacking. Maybe we can spot some easy code changes we could make to improve >>> performance, or suggest a tuning variable you could change. >>> >>> -Todd >>> >>> >>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote: >>>> >>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com> >>>> wrote: >>>> >>>>> Hi Mike, >>>>> >>>>> First of all, thanks for the link. It looks like an interesting read. >>>>> I checked that Aerospike is currently at version 3.8.2.3, and in the >>>>> article, they are evaluating version 3.5.4. The main thing that impressed >>>>> me was their claim that they can beat Cassandra and HBase by 8x for >>>>> writing >>>>> and 25x for reading. Their big claim to fame is that Aerospike can write >>>>> 1M >>>>> records per second with only 50 nodes. I wanted to see if this is real. >>>>> >>>> >>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, >>>> depending on the size of your records and the insertion order. I've been >>>> playing with a ~70 node cluster recently and seen 1M+ writes/second >>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and >>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do >>>> better. >>>> >>>> >>>>> >>>>> To answer your questions, we have a DMP with user profiles with many >>>>> attributes. We create segmentation information off of these attributes to >>>>> classify them. Then, we can target advertising appropriately for our sales >>>>> department. Much of the data processing is for applying models on all or >>>>> if >>>>> not most of every profile’s attributes to find similarities (nearest >>>>> neighbor/clustering) over a large number of rows when batch processing or >>>>> a >>>>> small subset of rows for quick online scoring. So, our use case is a >>>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t >>>>> work well for these types of analytics. >>>>> >>>>> I read, that Aerospike in the release notes, they did do many >>>>> improvements for batch and scan operations. >>>>> >>>>> I wonder what your thoughts are for using Kudu for this. >>>>> >>>> >>>> Sounds like a good Kudu use case to me. I've heard great things about >>>> Aerospike for the low latency random access portion, but I've also heard >>>> that it's _very_ expensive, and not particularly suited to the columnar >>>> scan workload. Lastly, I think the Apache license of Kudu is much more >>>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct >>>> answer to the performance question :) >>>> >>>> >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote: >>>>> >>>>> Have you considered whether you have a scan heavy or a random access >>>>> heavy workload? Have you considered whether you always access / update a >>>>> whole row vs only a partial row? Kudu is a column store so has some >>>>> awesome performance characteristics when you are doing a lot of scanning >>>>> of >>>>> just a couple of columns. >>>>> >>>>> I don't know the answer to your question but if your concern is >>>>> performance then I would be interested in seeing comparisons from a perf >>>>> perspective on certain workloads. >>>>> >>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: >>>>> https://aphyr.com/posts/324-jepsen-aerospike >>>>> >>>>> I wonder if they have addressed any of those issues. >>>>> >>>>> Mike >>>>> >>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>> >>>>>> I am just curious. How will Kudu compare with Aerospike ( >>>>>> http://www.aerospike.com)? I went to a Spark Roadshow and found out >>>>>> about this piece of software. It appears to fit our use case perfectly >>>>>> since we are an ad-tech company trying to leverage our user profiles >>>>>> data. >>>>>> Plus, it already has a Spark connector and has a SQL-like client. The >>>>>> tables can be accessed using Spark SQL DataFrames and, also, made into >>>>>> SQL >>>>>> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from >>>>>> the >>>>>> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the >>>>>> Spark integration is well underway and, from the looks of it lately, >>>>>> almost >>>>>> complete. I would prefer to use Kudu since we are already a Cloudera >>>>>> shop, >>>>>> and Kudu is easy to deploy and configure using Cloudera Manager. I also >>>>>> hope that some of Aerospike’s speed optimization techniques can make it >>>>>> into Kudu in the future, if they have not been already thought of or >>>>>> included. >>>>>> >>>>>> Just some thoughts… >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Mike Percy >>>>> Software Engineer, Cloudera >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Todd Lipcon >>>> Software Engineer, Cloudera >>>> >>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >>> >>> >> >> > > -- Todd Lipcon Software Engineer, Cloudera