[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
helifu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12425 ) Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@18 PS1, Line 18: By consolidating the writes into a small number of containers, the : number of disk seeks and fsync calls can also be substantially reduced. 1) for the number of fsync calls: it would be only 1 for every DRS if DRS-collocated data is written into one container("row groups"), which is a little bit radical; 2) for the number of disk seeks: it's reduced indeed for the writing, but not for the reading. but in general, it's a very good idea because the disk io will be mitigated while writing, and that will help improve reading(scan). -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: helifu Gerrit-Comment-Date: Tue, 19 Mar 2019 09:11:27 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12425 ) Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc File src/kudu/fs/block_manager-stress-test.cc: http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc@317 PS1, Line 317: CHECK_OK(bm_->CreateBlock(CreateBlockOptions().with_tablet_id(test_tablet_name_), )); Would also be interesting to choose some blocks at random to buffer. http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc@350 PS1, Line 350: CHECK_OK(dirty_block->Finalize()); And maybe randomly select some blocks NOT to finalize. -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 13 Feb 2019 17:39:29 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12425 ) Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13 PS1, Line 13: This means that, instead of each flush : requiring its own LBM container, many of the output blocks can be : consolidated into a small number of new containers (typically one per : disk). > Yea, the skepticism there is that, even if we put the blocks for a column i We talked about this on Slack; I'm posting a summary for posterity. The flush itself preserves key ordering within the rowset. As such, ordering rowsets by "time of flush" could yield a sequential scan (provided it's UNORDERED). However, that flush time ordering is discarded when a new rowset enters the RowSetTree: when queried, the tree will yield intersecting rowsets in start key order. If we cared about sequential block access within a container, we'd have to reorder the rowsets by rowset ID before scanning. Sequential scanning vs. buffering of written data presents a read/write tradeoff, so even long term it may make sense to support both and allow users to choose on a by-table basis. Could you include a brief summary of this discussion in your commit message? -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 13 Feb 2019 07:47:41 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12425 ) Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13 PS1, Line 13: This means that, instead of each flush : requiring its own LBM container, many of the output blocks can be : consolidated into a small number of new containers (typically one per : disk). > While certainly an improvement over the random block-to-container assignmen Yea, the skepticism there is that, even if we put the blocks for a column in one container, we might not be flushing in exactly key order, so doing so still doesn't allow you to just do a big sequential read of that column. The "row-oriented"-ness here is sorta half true. What we end up with is typically referred to as "PAX" -- essentially you divide into blocks of rows, and within the block, its' columnar. Parquet calls these blocks "row groups". We call them "DRS". Whether we lay out the column chunks in one container or across a bunch isn't really too different, right? (Parquet always puts them in a single file) http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@27 PS1, Line 27: up to 2x overhead from buffer size-doubling > Not following this bit; there's just one buffer per writable block, so wher just because append()ing 10 bytes to a 32kb buffer will produce a 64kb buffer. In the limit, a faststring may have a capacity twice its current length. I'll try to clarify. We could improve this max overhead by using an iovec-like chain of smaller buffers of course but I was lazy so I went with the quickest prototype for a buffer. http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@42 PS1, Line 42:"data dirs.queue_time_us" : 1847, :"data dirs.run_cpu_time_us" : 409, :"data dirs.run_wall_time_us" : 21185, > Were these edited out of "with the change"? If so, could drop them from her hmm I actually don't recall if I edited them out or they just weren't there. http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@66 PS1, Line 66:"lbm_writes_lt_1ms" : 1419 > This is almost an order of magnitude less than before the change. Any thoug total number of writes is less because we are only doing one write when we flush the buffer, whereas before we'd call block->Append() it would do a write directly to the disk http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@73 PS1, Line 73: On a spinning disk the reduction from 450 : fsyncs to 5 fsyncs is likely to be even more impactful in terms of wall : clock. > Yeah I'd love to see some numbers on a spinning disk. I think Grant was going to try this out on a real workload on spinning disks? -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 13 Feb 2019 07:15:28 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12425 ) Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. Patch Set 1: (11 comments) http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@9 PS1, Line 9: This adds the ability to the LogBlockManager to write a block that is : fully buffered in memory before eventually being flushed. Worth noting that if this behavior becomes the default, the LogWritableBlock finalize vs. close code could be drastically simplified (in effect, there'd be no difference between the two because an in-memory writable block does not occupy a container). http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13 PS1, Line 13: This means that, instead of each flush : requiring its own LBM container, many of the output blocks can be : consolidated into a small number of new containers (typically one per : disk). While certainly an improvement over the random block-to-container assignment we have going on now, I don't know how this will work when considering the longer term vision of "put all blocks belonging to one column in one container" (i.e. to reduce the number of seeks needed to scan a particular column's data). I know you've voiced skepticism about this in the past; could you remind me what your concerns were? It's also interesting to consider this in light of what you write in the next paragraph, as storing DRS-colocated data in one container is decidedly "row-oriented" storage. http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@24 PS1, Line 24: --diskrowset-writer-buffer-data Nit: although valid, we still prefer "--foo_bar_baz" as the canonical gflag name to make searching easier, and that ought to apply to searching commit history too. http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@27 PS1, Line 27: up to 2x overhead from buffer size-doubling Not following this bit; there's just one buffer per writable block, so where's the doubling effect come from? http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@42 PS1, Line 42:"data dirs.queue_time_us" : 1847, :"data dirs.run_cpu_time_us" : 409, :"data dirs.run_wall_time_us" : 21185, Were these edited out of "with the change"? If so, could drop them from here too. I'm also curious as to what this actually was. FlushMRSOp doesn't delete blocks, so there's no hole punching. I can't see how else the flush path might do some work on the data dir threadpools. http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@66 PS1, Line 66:"lbm_writes_lt_1ms" : 1419 This is almost an order of magnitude less than before the change. Any thoughts on why? http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@73 PS1, Line 73: On a spinning disk the reduction from 450 : fsyncs to 5 fsyncs is likely to be even more impactful in terms of wall : clock. Yeah I'd love to see some numbers on a spinning disk. http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager-test.cc File src/kudu/fs/log_block_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager-test.cc@517 PS1, Line 517: // TODO(todd) add tests for aborting buffered blocks and making sure the metrics : // are appropriate Looks like you did this below. http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc@20 PS1, Line 20: #include Let's use cerrno instead. http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc@1667 PS1, Line 1667: RETURN_NOT_OK(manager_->GetOrCreateContainer(opts_, _)); What does it mean for this to fail? Before this change, creating a container was tightly coupled to block creation; if the former failed, so did the latter. That's no longer the case. What happens if we run out of disk space trying to create a new container for a block with some data in its buffer? http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/tablet/delta_compaction.cc@278 PS1, Line 278: Status MajorDeltaCompaction::OpenRedoDeltaFileWriter() { Why not extend buffering to the redo/undo blocks here? They're logically grouped into a transaction with the new base data, so why not buffer and colocate them too? -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425
[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets
Hello Mike Percy, Andrew Wong, Adar Dembo, Grant Henke, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/12425 to review the following change. Change subject: KUDU-2693. Buffer blocks while flushing rowsets .. KUDU-2693. Buffer blocks while flushing rowsets This adds the ability to the LogBlockManager to write a block that is fully buffered in memory before eventually being flushed. By using this in the DiskRowSet writing code path, we're able to write out all column blocks in parallel into memory buffers and then flush them serially when we roll the DRS writer. This means that, instead of each flush requiring its own LBM container, many of the output blocks can be consolidated into a small number of new containers (typically one per disk). By consolidating the writes into a small number of containers, the number of disk seeks and fsync calls can also be substantially reduced. This is likely to also reduce LBM fragmentation if the DRS is later deleted (eg compacted away) since all of the data for the given DRS should be contiguous in a container. The feature is enabled by a new flag '--diskrowset-writer-buffer-data'. It is disabled by default because there is a slight memory cost associated: likely up to 64MB per maintenance thread given the default 32MB rolling threshold and up to 2x overhead from buffer size-doubling. We should benchmark and validate the improvement on a real workload before enabling by default. I tested this by setting up a local TS with two data dirs and writing a table with 201 columns. I then looked at the trace metrics from FlushMRSOp. Without the new flag enabled, I saw metrics like the following (edited to retain only the relevant ones): Timing: real 7.492s user 6.039s sys 0.267s { "bytes_written" : 267764138, "cfile_init" : 7, "data dirs.queue_time_us" : 1847, "data dirs.run_cpu_time_us" : 409, "data dirs.run_wall_time_us" : 21185, "drs_written" : 7, "fdatasync" : 451,< lots of fsyncs "fdatasync_us" : 903636, "lbm_writes_1-10_ms" : 1, "lbm_writes_lt_1ms" : 9988, "lbm_write_time_us" : 160774, "rows_written" : 643667, "threads_started" : 8, "thread_start_us" : 296, } With the change: Timing: real 6.719s user 6.335s sys 0.192s { "bytes_written" : 270056800, "cfile_init" : 7, "drs_written" : 7, "fdatasync" : 5, <--- big reduction "fdatasync_us" : 88812, "lbm_writes_1-10_ms" : 2, "lbm_writes_lt_1ms" : 1419 "lbm_write_time_us" : 131746, "rows_written" : 649506, "threads_started" : 1, "thread_start_us" : 34, } This test was on my local SSD. On a spinning disk the reduction from 450 fsyncs to 5 fsyncs is likely to be even more impactful in terms of wall clock. Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tablet/delta_compaction.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/diskrowset.cc M src/kudu/tablet/multi_column_writer.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/util/env_posix.cc 13 files changed, 263 insertions(+), 62 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/12425/1 -- To view, visit http://gerrit.cloudera.org:8080/12425 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0 Gerrit-Change-Number: 12425 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Mike Percy