[ https://issues.apache.org/jira/browse/KUDU-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin updated KUDU-3565: -------------------------------- Description: Current implementation of scheduling maintenance operations inadvertently induces much more disk IO demand than it could. The maintenance manager thread can schedule multiple IO-heavy operations _on the same tablet_ to run concurrently when there is more than one maintenance worker thread. Sometimes, it creates bottlenecks by saturating disk IO throughput since the maintenance worker threads access data backed by the same set of drives when working on the same tablet. Apparently, such behavior doesn't scale well when increasing the number of maintenance worker threads, especially when data directories are backed by spinning drives (HDD). In the presence of four maintenance threads I saw the logs like below. Notice that when four operations are scheduled to run almost at the same time, it takes more than 3 seconds to run each. Other instances of similar flush operations at the same tablet take less than 100ms to complete when no other flush operation on the same tablet is running. Notice that not only fdatasync_us is higher for concurrently running operations, but some of them now waiting on synchronization primitives, blocking the maintenance threads from doing other useful work, e.g. compacting/flushing data of other active tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding synchronization primitives as a writer, blocks concurrently running insert/update/delete operations since they acquire the same synchronization primitives as readers when checking for a primary key's presence. So, long running flush operations induce an extra latency into concurrently running write workloads. It's necessary to improve the scheduling algorithm of maintenance operations, differentiating operations not only by their score, but taking into account the IO impact and dependencies between them. As a simplest approach, it's necessary to add an option to enable the following behavior: * an IO-heavy operation (such as flush, merge compaction, etc.) should not be scheduled when there is another one scheduled to run or already running at the same tablet {noformat} I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035118I0328 I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 0.084s user 0.017s sys 0.005s Metrics: {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500} ... I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150 I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182 I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214 I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244 ... I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.280s user 0.021s sys 0.005s Metrics: {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896} I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.278s user 0.021s sys 0.000s Metrics: {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381} ... I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.219s user 0.018s sys 0.003s Metrics: {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915} I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.159s user 0.015s sys 0.002s Metrics: {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866} {noformat} was: Current implementation of scheduling maintenance operations inadvertently induces much more disk IO demand than it could. The maintenance manager thread can schedule multiple IO-heavy operations _on the same tablet_ to run concurrently when there is more than one maintenance worker thread. Sometimes, it creates bottlenecks by saturating disk IO throughput since the maintenance worker threads access data backed by the same set of drives when working on the same tablet. Apparently, such behavior doesn't scale well when increasing the number of maintenance worker threads, especially when data directories are backed by spinning drives (HDD). In the presence of four maintenance threads I saw the logs like below. Notice that when four operations are scheduled to run almost at the same time, it takes more than 3 seconds to run each. Other instances of similar flush operations at the same tablet take less than 100ms to complete when no other flush operation on the same tablet is running. Notice that not only fdatasync_us is higher for concurrently running operations, but some of them now waiting on synchronization primitives, blocking the maintenance threads from doing other useful work, e.g. compacting/flushing data of other active tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding synchronization primitives as a writer, blocks concurrently running insert/update/delete operations since they acquire the same synchronization primitives as readers when checking for a primary key's presence. So, long running flush operations induce an extra latency into concurrently running write workloads. It's necessary to improve the scheduling algorithm of maintenance operations, differentiating operations not only by their score, but taking into account the IO impact and dependencies between them. As a simplest approach, it's necessary to add an option to enable the following behavior: * an IO-heavy operation (such as flush, merge compaction, etc.) should not be scheduled when there is another one scheduled to run or already running at the same tablet {noformat} I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035118I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 0.084s user 0.017s sys 0.005s Metrics: {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500} ... I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150 I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182 I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214 I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244 ... I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.280s user 0.021s sys 0.005s Metrics: {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896} I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.278s user 0.021s sys 0.000s Metrics: {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381} ... I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.219s user 0.018s sys 0.003s Metrics: {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915} I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.159s user 0.015s sys 0.002s Metrics: {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866} {noformat} > Avoid scheduling multiple IO-heavy maintenance operations at once for the > same tablet > ------------------------------------------------------------------------------------- > > Key: KUDU-3565 > URL: https://issues.apache.org/jira/browse/KUDU-3565 > Project: Kudu > Issue Type: Improvement > Components: master, tserver > Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, > 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, > 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0 > Reporter: Alexey Serbin > Priority: Major > > Current implementation of scheduling maintenance operations inadvertently > induces much more disk IO demand than it could. The maintenance manager > thread can schedule multiple IO-heavy operations _on the same tablet_ to run > concurrently when there is more than one maintenance worker thread. > Sometimes, it creates bottlenecks by saturating disk IO throughput since the > maintenance worker threads access data backed by the same set of drives when > working on the same tablet. > Apparently, such behavior doesn't scale well when increasing the number of > maintenance worker threads, especially when data directories are backed by > spinning drives (HDD). > In the presence of four maintenance threads I saw the logs like below. > Notice that when four operations are scheduled to run almost at the same > time, it takes more than 3 seconds to run each. Other instances of similar > flush operations at the same tablet take less than 100ms to complete when no > other flush operation on the same tablet is running. Notice that not only > fdatasync_us is higher for concurrently running operations, but some of them > now waiting on synchronization primitives, blocking the maintenance threads > from doing other useful work, e.g. compacting/flushing data of other active > tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding > synchronization primitives as a writer, blocks concurrently running > insert/update/delete operations since they acquire the same synchronization > primitives as readers when checking for a primary key's presence. So, long > running flush operations induce an extra latency into concurrently running > write workloads. > It's necessary to improve the scheduling algorithm of maintenance operations, > differentiating operations not only by their score, but taking into account > the IO impact and dependencies between them. As a simplest approach, it's > necessary to add an option to enable the following behavior: > * an IO-heavy operation (such as flush, merge compaction, etc.) should not be > scheduled when there is another one scheduled to run or already running at > the same tablet > {noformat} > I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P > db78de4070874e75be1482b02b5161e6: Scheduling > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf > score=0.035118I0328 > I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P > db78de4070874e75be1482b02b5161e6: > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: > real 0.084s user 0.017s sys 0.005s Metrics: > {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500} > ... > I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P > db78de4070874e75be1482b02b5161e6: Scheduling > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150 > I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P > db78de4070874e75be1482b02b5161e6: Scheduling > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182 > I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P > db78de4070874e75be1482b02b5161e6: Scheduling > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214 > I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P > db78de4070874e75be1482b02b5161e6: Scheduling > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244 > ... > I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P > db78de4070874e75be1482b02b5161e6: > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: > real 3.280s user 0.021s sys 0.005s Metrics: > {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896} > I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P > db78de4070874e75be1482b02b5161e6: > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: > real 3.278s user 0.021s sys 0.000s Metrics: > {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381} > ... > I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P > db78de4070874e75be1482b02b5161e6: > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: > real 3.219s user 0.018s sys 0.003s Metrics: > {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915} > I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P > db78de4070874e75be1482b02b5161e6: > FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: > real 3.159s user 0.015s sys 0.002s Metrics: > {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866} > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)