Cool, thanks for the report, Ben. For what it's worth, I think there's
still some low hanging fruit in the Spark connector for Kudu (for example,
I believe locality on reads is currently broken). So, you can expect
performance to continue to improve in future versions. I'd also be
interested to see results on Kudu for a much larger dataset - my guess is a
lot of the 6 seconds you're seeing is constant overhead from Spark job
setup, etc, given that the performance doesn't seem to get slower as you
went from 700K rows to 13M rows.

-Todd

On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> FYI.
>
> I did a quick-n-dirty performance test.
>
> First, the setup:
> QA cluster:
>
>    - 15 data nodes
>       - 64GB memory each
>       - HBase is using 4GB of memory
>       - Kudu is using 1GB of memory
>    - 1 HBase/Kudu master node
>       - 64GB memory
>       - HBase/Kudu master is using 1GB of memory each
>    - 10Gb Ethernet
>
>
> Using Spark on both to load/read events data (84 columns per row), I was
> able to record performance for each. On the HBase side, I used the Phoenix
> 4.7 Spark plugin where DataFrames can be used directly. On the Kudu side, I
> used the Spark connector. I created an events table in Phoenix using the
> CREATE TABLE statement and created the equivalent in Kudu using the Spark
> method based off of a DataFrame schema.
>
> Here are the numbers for Phoenix/HBase.
> 1st run:
> > 715k rows
> - write: 2.7m
>
> > 715k rows in HBase table
> - read: 0.1s
> - count: 3.8s
> - aggregate: 61s
>
> 2nd run:
> > 5.2M rows
> - write: 11m
> * had 4 region servers go down, had to retry the 5.2M row write
>
> > 5.9M rows in HBase table
> - read: 8s
> - count: 3m
> - aggregate: 46s
>
> 3rd run:
> > 6.8M rows
> - write: 9.6m
>
> > 12.7M rows
> - read: 10s
> - count: 3m
> - aggregate: 44s
>
>
> Here are the numbers for Kudu.
> 1st run:
> > 715k rows
> - write: 18s
>
> > 715k rows in Kudu table
> - read: 0.2s
> - count: 18s
> - aggregate: 5s
>
> 2nd run:
> > 5.2M rows
> - write: 33s
>
> > 5.9M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
>
> 3rd run:
> > 6.8M rows
> - write: 27s
>
> > 12.7M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
>
> The Kudu results are impressive if you take these number as-is. Kudu is
> close to 18x faster at writing (UPSERT). Kudu is 30x faster at reading
> (HBase times increase as data size grows).  Kudu is 7x faster at full row
> counts. Lastly, Kudu is 3x faster doing an aggregate query (count distinct
> event_id’s per user_id). *Remember that this is small cluster, times are
> still respectable for both systems, HBase could have been configured
> better, and the HBase table could have been better tuned.
>
> Cheers,
> Ben
>
>
> On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com> wrote:
>
> Adding partition splits when range partitioning is done via the
> CreateTableOptions.addSplitRow
> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>  method.
> You can find more about the different partitioning options in the schema
> design guide <http://getkudu.io/docs/schema_design.html#data-distribution>.
> We generally recommend sticking to hash partitioning if possible, since you
> don't have to determine your own split rows.
>
> - Dan
>
> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Todd,
>>
>> I think the locality is not within our setup. We have the compute cluster
>> with Spark, YARN, etc. on its own, and we have the storage cluster with
>> HBase, Kudu, etc. on another. We beefed up the hardware specs on the
>> compute cluster and beefed up storage capacity on the storage cluster. We
>> got this setup idea from the Databricks folks. I do have a question. I
>> created the table to use range partition on columns. I see that if I use
>> hash partition I can set the number of splits, but how do I do that using
>> range (50 nodes * 10 = 500 splits)?
>>
>> Thanks,
>> Ben
>>
>>
>> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>> Awesome use case. One thing to keep in mind is that spark parallelism
>> will be limited by the number of tablets. So, you might want to split into
>> 10 or so buckets per node to get the best query throughput.
>>
>> Usually if you run top on some machines while running the query you can
>> see if it is fully utilizing the cores.
>>
>> Another known issue right now is that spark locality isn't working
>> properly on replicated tables so you will use a lot of network traffic. For
>> a perf test you might want to try a table with replication count 1
>> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
>>
>> Hi Todd,
>>
>> I did a simple test of our ad events. We stream using Spark Streaming
>> directly into HBase, and the Data Analysts/Scientists do some
>> insight/discovery work plus some reports generation. For the reports, we
>> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data
>> currency store of choice is DataFrames.
>>
>> The schema is around 83 columns wide where most are of the string data
>> type.
>>
>> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip",
>> "user_id", "mappable_id",
>> "cookie_status", "profile_status", "user_status", "previous_timestamp",
>> "user_agent", "referer",
>> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id",
>> "creative_id",
>> "location_id", “pcamp_id",
>> "pdomain_id", "continent_code", "country", "region", "dma", "city",
>> "zip", "isp", "line_speed",
>> "gender", "year_of_birth", "behaviors_read", "behaviors_written",
>> "key_value_pairs", "acamp_candidates",
>> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip",
>> "pixel_id", “video_id",
>> "video_network_id", "video_time_watched", "video_percentage_watched",
>> "video_media_type",
>> "video_player_iframed", "video_player_in_view", "video_player_width",
>> "video_player_height",
>> "conversion_valid_sale", "conversion_sale_amount",
>> "conversion_commission_amount", "conversion_step",
>> "conversion_currency", "conversion_attribution", "conversion_offer_id",
>> "custom_info", "frequency",
>> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
>> "optimizer_creative_id", "optimizer_ecpm", "impression_id",
>> "diagnostic_data",
>> "user_profile_mapping_source", "latitude", "longitude", "area_code",
>> "gmt_offset", "in_dst",
>> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires",
>> "timestamp_iso", "reference_id",
>> "identity_organization", "identity_method"
>>
>> Most queries are like counts of how many users use what browser, how many
>> are unique users, etc. The part that scares most users is when it comes to
>> joining this data with other dimension/3rd party events tables because of
>> shear size of it.
>>
>> We do what most companies do, similar to what I saw in earlier
>> presentations of Kudu. We dump data out of HBase into partitioned Parquet
>> tables to make query performance manageable.
>>
>> I will coordinate with a data scientist today to do some tests. He is
>> working on identity matching/record linking of users from 2 domains: US and
>> Singapore, using probabilistic deduping algorithms. I will load the data
>> from ad events from both countries, and let him run his process against
>> this data in Kudu. I hope this will “wow” the team.
>>
>> Thanks,
>> Ben
>>
>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>> Hi Benjamin,
>>
>> What workload are you using for benchmarks? Using spark or something more
>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
>> some queries
>>
>> Todd
>>
>> Todd
>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
>>
>>> Hi Todd,
>>>
>>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
>>> impressed. Compared to HBase, read and write performance are better. Write
>>> performance has the greatest improvement (> 4x), while read is > 1.5x.
>>> Albeit, these are only preliminary tests. Do you know of a way to really do
>>> some conclusive tests? I want to see if I can match your results on my 50
>>> node cluster.
>>>
>>> Thanks,
>>> Ben
>>>
>>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com>
>>> wrote:
>>>
>>>> Todd,
>>>>
>>>> It sounds like Kudu can possibly top or match those numbers put out by
>>>> Aerospike. Do you have any performance statistics published or any
>>>> instructions as to measure them myself as good way to test? In addition,
>>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0
>>>> where support will be built in?
>>>>
>>>
>>> We don't have a lot of benchmarks published yet, especially on the write
>>> side. I've found that thorough cross-system benchmarks are very difficult
>>> to do fairly and accurately, and often times users end up misguided if they
>>> pay too much attention to them :) So, given a finite number of developers
>>> working on Kudu, I think we've tended to spend more time on the project
>>> itself and less time focusing on "competition". I'm sure there are use
>>> cases where Kudu will beat out Aerospike, and probably use cases where
>>> Aerospike will beat Kudu as well.
>>>
>>> From my perspective, it would be great if you can share some details of
>>> your workload, especially if there are some areas you're finding Kudu
>>> lacking. Maybe we can spot some easy code changes we could make to improve
>>> performance, or suggest a tuning variable you could change.
>>>
>>> -Todd
>>>
>>>
>>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>>
>>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mike,
>>>>>
>>>>> First of all, thanks for the link. It looks like an interesting read.
>>>>> I checked that Aerospike is currently at version 3.8.2.3, and in the
>>>>> article, they are evaluating version 3.5.4. The main thing that impressed
>>>>> me was their claim that they can beat Cassandra and HBase by 8x for 
>>>>> writing
>>>>> and 25x for reading. Their big claim to fame is that Aerospike can write 
>>>>> 1M
>>>>> records per second with only 50 nodes. I wanted to see if this is real.
>>>>>
>>>>
>>>> 1M records per second on 50 nodes is pretty doable by Kudu as well,
>>>> depending on the size of your records and the insertion order. I've been
>>>> playing with a ~70 node cluster recently and seen 1M+ writes/second
>>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
>>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do
>>>> better.
>>>>
>>>>
>>>>>
>>>>> To answer your questions, we have a DMP with user profiles with many
>>>>> attributes. We create segmentation information off of these attributes to
>>>>> classify them. Then, we can target advertising appropriately for our sales
>>>>> department. Much of the data processing is for applying models on all or 
>>>>> if
>>>>> not most of every profile’s attributes to find similarities (nearest
>>>>> neighbor/clustering) over a large number of rows when batch processing or 
>>>>> a
>>>>> small subset of rows for quick online scoring. So, our use case is a
>>>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t
>>>>> work well for these types of analytics.
>>>>>
>>>>> I read, that Aerospike in the release notes, they did do many
>>>>> improvements for batch and scan operations.
>>>>>
>>>>> I wonder what your thoughts are for using Kudu for this.
>>>>>
>>>>
>>>> Sounds like a good Kudu use case to me. I've heard great things about
>>>> Aerospike for the low latency random access portion, but I've also heard
>>>> that it's _very_ expensive, and not particularly suited to the columnar
>>>> scan workload. Lastly, I think the Apache license of Kudu is much more
>>>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct
>>>> answer to the performance question :)
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote:
>>>>>
>>>>> Have you considered whether you have a scan heavy or a random access
>>>>> heavy workload? Have you considered whether you always access / update a
>>>>> whole row vs only a partial row? Kudu is a column store so has some
>>>>> awesome performance characteristics when you are doing a lot of scanning 
>>>>> of
>>>>> just a couple of columns.
>>>>>
>>>>> I don't know the answer to your question but if your concern is
>>>>> performance then I would be interested in seeing comparisons from a perf
>>>>> perspective on certain workloads.
>>>>>
>>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test:
>>>>> https://aphyr.com/posts/324-jepsen-aerospike
>>>>>
>>>>> I wonder if they have addressed any of those issues.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>>
>>>>>> I am just curious. How will Kudu compare with Aerospike (
>>>>>> http://www.aerospike.com)? I went to a Spark Roadshow and found out
>>>>>> about this piece of software. It appears to fit our use case perfectly
>>>>>> since we are an ad-tech company trying to leverage our user profiles 
>>>>>> data.
>>>>>> Plus, it already has a Spark connector and has a SQL-like client. The
>>>>>> tables can be accessed using Spark SQL DataFrames and, also, made into 
>>>>>> SQL
>>>>>> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from 
>>>>>> the
>>>>>> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the
>>>>>> Spark integration is well underway and, from the looks of it lately, 
>>>>>> almost
>>>>>> complete. I would prefer to use Kudu since we are already a Cloudera 
>>>>>> shop,
>>>>>> and Kudu is easy to deploy and configure using Cloudera Manager. I also
>>>>>> hope that some of Aerospike’s speed optimization techniques can make it
>>>>>> into Kudu in the future, if they have not been already thought of or
>>>>>> included.
>>>>>>
>>>>>> Just some thoughts…
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Mike Percy
>>>>> Software Engineer, Cloudera
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>>
>>>
>>
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to