Alexey Serbin created KUDU-3565:
-----------------------------------

             Summary: Avoid scheduling multiple IO-heavy maintenance operations 
at once for the same tablet
                 Key: KUDU-3565
                 URL: https://issues.apache.org/jira/browse/KUDU-3565
             Project: Kudu
          Issue Type: Improvement
          Components: master, tserver
    Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.11.1, 1.12.0, 
1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 
1.3.0, 1.2.0, 1.1.0, 1.0.1, 1.0.0
            Reporter: Alexey Serbin


Current implementation of scheduling maintenance operations inadvertently 
induces much more disk IO demand than it could.  The maintenance manager thread 
can schedule multiple IO-heavy operations _on the same tablet_ to run 
concurrently when there is more  than one maintenance worker thread.  
Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
maintenance worker threads access data backed by the same set of drives when 
working on the same tablet.

Apparently, such behavior doesn't scale well when increasing the number of 
maintenance worker threads, especially when data directories are backed by 
spinning drives (HDD).

In the presence of four maintenance threads I saw the logs like below.  Notice 
that when four operations are scheduled to run almost at the same time, it 
takes more than 3 seconds to run each.  Other instances of similar flush 
operations at the same tablet  take less than 100ms to complete when no other 
flush operation on the same tablet is running.  Notice that not only 
fdatasync_us is higher for concurrently running operations, but some of them 
now waiting on synchronization primitives, blocking the maintenance threads 
from doing other useful work, e.g. compacting/flushing data of other active 
tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
synchronization primitives as a writer, blocks concurrently running 
insert/update/delete operations since they acquire the same synchronization 
primitives as readers when checking for a primary key's presence.  So, long 
running flush operations induce an extra latency into concurrently running 
write workloads.

It's necessary to improve the scheduling algorithm of maintenance operations, 
differentiating operations not only by their score, but taking into account the 
IO impact and dependencies between them.  As a simplest approach, it's 
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
scheduled when there is another one scheduled to run or already running at the 
same tablet

{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
score=0.035118I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. 
Timing: real 0.084s     user 0.017s     sys 0.005s Metrics: 
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.280s     user 0.021s     sys 0.005s Metrics: 
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.278s     user 0.021s     sys 0.000s Metrics: 
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.219s     user 0.018s     sys 0.003s Metrics: 
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.159s     user 0.015s     sys 0.002s Metrics: 
{"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to