thanks for explanation, Todd.now waiting for 0.10.0 parcels to test. Get Outlook for iOS
On Fri, Aug 19, 2016 at 2:11 AM +0800, "Todd Lipcon" <[email protected]> wrote: Hey Ben and Jacky, Apologies for my late response. As you might have seen on the Kudu blog, a lot of the contributors have been busy wrapping up the 0.10.0 release this week. Answers inline On Aug 16, 2016, at 6:05 PM, [email protected] wrote: Thanks Todd. Kudu cluster running on centos 7.2, each tablet node has 40 cores, the test table is about 140GB after 3 reps, and partitioned by hash bucket, I had tried 24 and 120 hash buckets. I do one test: 1. Stop all ingestion to the cluster2. Just randomly upsert 3000 rows once, upsert contains new data row or just updates to exisit row (updates the whole row, not just updates one or more column)3. From the CDH monitor dashboard, I see the cluster's disk I/O raising from ~300Mb/s to ~1.5Gb/s, and get back the ~300Mb/s 30min later or more I check some of tablet node INFO log, they are always doing compaction, compacting 1~ 100s of thousands rows. My question:1. Are the maintenance manager is rewriting the whole table? 3000 rows upsert once will trigger a rewriting the whole table? If those 3000 rows are spread across the whole key space, then yes, it currently will. If, on the other hand, you had a table something like: CREATE TABLE t ( ts INTEGER PRIMARY KEY, other_data string, ...) DISTRIBUTE BY HASH(ts) INTO 120 BUCKETS and your inserts were for a small range of time (eg concentrated around "now") then the compaction would only rewrite the portion of key space that has new rows (or updates) affecting it. 2. Does the background I/O have impacts to the scan performance.Of course it has some. However, it is restricted by default to a single thread, so it should use only a small percentage of the machine's total capacity. I had worked on a patch a few years ago to use ioprio_set to mark these I/Os as "low priority", but didn't have enough time to validate it helping with any workload, and thus, it didn't get committed. In case anyone's interested in trying it, I posted the (very old) diff to https://gist.github.com/toddlipcon/faaf8e74b4dae93e668e8bda1118b58a . It will probably need some conflict resolution to apply it. 3. About the number of hash partitioned buckets, I partitioned the table to 24 or 120 buckets, what's the difference in upsert and scan performance? and what is the best practices?For write performance, I'd recommend several (5-10) of tablets per tablet server being actively written to. Individual tablets can handle multi-threaded writes pretty well, though there are some bottlenecks at very high throughputs, so just having one per tablet server would not get peak performance. On the read side, it's worth noting that _currently_ both the Spark and Impala integrations start only one scanner thread per tablet. So, the number of tablets limits scan parallelism. Given that, I expect you would see better performance with 120 buckets. This is, however, a temporary limitation: we intend to add some API at some point to make it easier for query engines to divide the scanner into smaller chunks which can be spread across more threads regardless of the number of tablets. 4. What is the recommended setting for tablet server memory hard limit? It depends how much you want to devote to other applications :) On the write-buffering side, I have seen diminishing returns after 10GB or so. However, you can use a much bigger memory limit and devote an arbitrary amount to the block cache, which will improve both write performance (since you can get 100% cache hit rate on bloom filters) as well as read performance (since you will hit cache more on reads). Of course, it's worth noting that if you stick to a small memory limit for Kudu, the Linux block cache is still used, and you can still get many reads serviced from memory. On Tue, Aug 16, 2016 at 6:09 PM, Benjamin Kim <[email protected]> wrote: This could be a problem… If this is a bad byproduct brought over from HBase, then this is a common issue for all HBase users. It would be too bad if this also exists in Kudu. We HBase users have been trying to eradicate this for a long time. The main difference between compaction in Kudu and in HBase is that all of our compaction is "incremental". We also have the design that compaction is always running at the same speed. As you've noticed, you should expect that in a live Kudu cluster, there's always some background work consuming IO. This has the downside that you're always seeing some performance impact due to it. This has the huge (IMHO) upside that you are _always_ seeing the performance impact -- in other words, you will never be "surprised" by a compaction starting. This is unlike the design in HBase (and many other LSM-tree designs) where there is a distinct "trigger point" at which compaction starts. In those systems, you can have something performing well during testing, and then all of a sudden reach some threshold where the performance profile drastically changes (possibly resulting in getting paged in the middle of the night). Personally, as an operator, I would always pick a system which is _consistently_ a bit slower than one which is sometimes faster and at arbitrary times goes into a slow mode. One thing we should probably consider is having some sort of very low threshold below which we don't trigger a compaction. The example of a large table with a few thousand inserts is a good one - we are probably better off just waiting until the situation is a little bit worse before starting compaction. We just don't want to wait until it's out of control. Hope that reasoning makes sense to you, and that your experience testing Kudu has borne it out. -Todd [email protected] From: Todd LipconDate: 2016-08-17 01:58To: userSubject: Re: abnormal high disk I/O rate when upsert into kudu table?Hi Jacky, Answers inline below On Tue, Aug 16, 2016 at 8:13 AM, [email protected] <[email protected]> wrote: Dear Kudu Developers, I am a new tester for kudu, our kudu cluster has 3+12 nodes, 3 seperated master node and 12 tablet node, each node has 128GB memory, and 1 SSD for WAL, 6 1TB SAS for data we are using CDH 5.7.0 with impala-kudu 2.7.0 and kudu 0.9.1 parcels, we set 16GB memory hard limit for each tablet node. Sounds like a good cluster setup. Thanks for providing the details. one of our test table is about 80-100 columns and 1 key column, with java client, we can insert/upsert into the kudu table about 100,000/s the kudu table has 300m rows, and about 300,000 rows update per day, we also use java client upsert API to update the rows we found the kudu cluster maybe encounter abnormal high disk I/O rate, about 1.5-2.0Gb/s, even we just update 1,000~10,000 rows/s i would like to know, with our row update frequency, is the cluster high disk rate normal or not? Are you upserts randomly spread across the range of rows in the table? If so, then when the updates flush, they'll trigger compactions of the updates and inserted rows into the existing data. This will cause, over time, a rewrite of the whole table, in order to incorporate the updates. This background I/O is run by the "maintenance manager". You can visit http://tablet-server:8050/maintenance-manager to see a dashboard of currently running maintenance operations such as compactions. The maintenance manager runs a preset number of threads, so the amount of background I/O you're experiencing won't increase if you increase the number of upserts. I'm curious, is the background I/O causing an issue, or just unexpected? Thanks-Todd-- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
