[ 
https://issues.apache.org/jira/browse/KUDU-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3565:
--------------------------------
    Description: 
Current implementation of scheduling maintenance operations inadvertently 
induces much more disk IO demand than it could.  The maintenance manager thread 
can schedule multiple IO-heavy operations _on the same tablet_ to run 
concurrently when there is more  than one maintenance worker thread.  
Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
maintenance worker threads access data backed by the same set of drives when 
working on the same tablet.

Apparently, such behavior doesn't scale well when increasing the number of 
maintenance worker threads, especially when data directories are backed by 
spinning drives (HDD).

In the presence of four maintenance threads I saw the logs like below.  Notice 
that when four operations are scheduled to run almost at the same time, it 
takes more than 3 seconds to run each.  Other instances of similar flush 
operations at the same tablet  take less than 100ms to complete when no other 
flush operation on the same tablet is running.  Notice that not only 
fdatasync_us is higher for concurrently running operations, but some of them 
now waiting on synchronization primitives, blocking the maintenance threads 
from doing other useful work, e.g. compacting/flushing data of other active 
tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
synchronization primitives as a writer, blocks concurrently running 
insert/update/delete operations since they acquire the same synchronization 
primitives as readers when checking for a primary key's presence.  So, long 
running flush operations induce an extra latency into concurrently running 
write workloads.

It's necessary to improve the scheduling algorithm of maintenance operations, 
differentiating operations not only by their score, but taking into account the 
IO impact and dependencies between them.  As a simplest approach, it's 
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
scheduled when there is another one scheduled to run or already running at the 
same tablet

{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
score=0.035118I0328
I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
0.084s     user 0.017s     sys 0.005s Metrics: 
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.280s     user 0.021s     sys 0.005s Metrics: 
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.278s     user 0.021s     sys 0.000s Metrics: 
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.219s     user 0.018s     sys 0.003s Metrics: 
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.159s     user 0.015s     sys 0.002s Metrics: 
{"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
{noformat}

  was:
Current implementation of scheduling maintenance operations inadvertently 
induces much more disk IO demand than it could.  The maintenance manager thread 
can schedule multiple IO-heavy operations _on the same tablet_ to run 
concurrently when there is more  than one maintenance worker thread.  
Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
maintenance worker threads access data backed by the same set of drives when 
working on the same tablet.

Apparently, such behavior doesn't scale well when increasing the number of 
maintenance worker threads, especially when data directories are backed by 
spinning drives (HDD).

In the presence of four maintenance threads I saw the logs like below.  Notice 
that when four operations are scheduled to run almost at the same time, it 
takes more than 3 seconds to run each.  Other instances of similar flush 
operations at the same tablet  take less than 100ms to complete when no other 
flush operation on the same tablet is running.  Notice that not only 
fdatasync_us is higher for concurrently running operations, but some of them 
now waiting on synchronization primitives, blocking the maintenance threads 
from doing other useful work, e.g. compacting/flushing data of other active 
tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
synchronization primitives as a writer, blocks concurrently running 
insert/update/delete operations since they acquire the same synchronization 
primitives as readers when checking for a primary key's presence.  So, long 
running flush operations induce an extra latency into concurrently running 
write workloads.

It's necessary to improve the scheduling algorithm of maintenance operations, 
differentiating operations not only by their score, but taking into account the 
IO impact and dependencies between them.  As a simplest approach, it's 
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
scheduled when there is another one scheduled to run or already running at the 
same tablet

{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
score=0.035118I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. 
Timing: real 0.084s     user 0.017s     sys 0.005s Metrics: 
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.280s     user 0.021s     sys 0.005s Metrics: 
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.278s     user 0.021s     sys 0.000s Metrics: 
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.219s     user 0.018s     sys 0.003s Metrics: 
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.159s     user 0.015s     sys 0.002s Metrics: 
{"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
{noformat}


> Avoid scheduling multiple IO-heavy maintenance operations at once for the 
> same tablet
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-3565
>                 URL: https://issues.apache.org/jira/browse/KUDU-3565
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, tserver
>    Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 
> 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 
> 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
>            Reporter: Alexey Serbin
>            Priority: Major
>
> Current implementation of scheduling maintenance operations inadvertently 
> induces much more disk IO demand than it could.  The maintenance manager 
> thread can schedule multiple IO-heavy operations _on the same tablet_ to run 
> concurrently when there is more  than one maintenance worker thread.  
> Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
> maintenance worker threads access data backed by the same set of drives when 
> working on the same tablet.
> Apparently, such behavior doesn't scale well when increasing the number of 
> maintenance worker threads, especially when data directories are backed by 
> spinning drives (HDD).
> In the presence of four maintenance threads I saw the logs like below.  
> Notice that when four operations are scheduled to run almost at the same 
> time, it takes more than 3 seconds to run each.  Other instances of similar 
> flush operations at the same tablet  take less than 100ms to complete when no 
> other flush operation on the same tablet is running.  Notice that not only 
> fdatasync_us is higher for concurrently running operations, but some of them 
> now waiting on synchronization primitives, blocking the maintenance threads 
> from doing other useful work, e.g. compacting/flushing data of other active 
> tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
> synchronization primitives as a writer, blocks concurrently running 
> insert/update/delete operations since they acquire the same synchronization 
> primitives as readers when checking for a primary key's presence.  So, long 
> running flush operations induce an extra latency into concurrently running 
> write workloads.
> It's necessary to improve the scheduling algorithm of maintenance operations, 
> differentiating operations not only by their score, but taking into account 
> the IO impact and dependencies between them.  As a simplest approach, it's 
> necessary to add an option to enable the following behavior:
> * an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
> scheduled when there is another one scheduled to run or already running at 
> the same tablet
> {noformat}
> I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
> db78de4070874e75be1482b02b5161e6: Scheduling 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
> score=0.035118I0328
> I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
> db78de4070874e75be1482b02b5161e6: 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: 
> real 0.084s     user 0.017s     sys 0.005s Metrics: 
> {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
> ...
> I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
> db78de4070874e75be1482b02b5161e6: Scheduling 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
> I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
> db78de4070874e75be1482b02b5161e6: Scheduling 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
> I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
> db78de4070874e75be1482b02b5161e6: Scheduling 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
> I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
> db78de4070874e75be1482b02b5161e6: Scheduling 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
> ...
> I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
> db78de4070874e75be1482b02b5161e6: 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: 
> real 3.280s     user 0.021s     sys 0.005s Metrics: 
> {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
> I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
> db78de4070874e75be1482b02b5161e6: 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: 
> real 3.278s     user 0.021s     sys 0.000s Metrics: 
> {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
> ...
> I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
> db78de4070874e75be1482b02b5161e6: 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: 
> real 3.219s     user 0.018s     sys 0.003s Metrics: 
> {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
> I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P 
> db78de4070874e75be1482b02b5161e6: 
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: 
> real 3.159s     user 0.015s     sys 0.002s Metrics: 
> {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to