[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-03-19 Thread helifu (Code Review)
helifu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12425 )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@18
PS1, Line 18: By consolidating the writes into a small number of containers, the
: number of disk seeks and fsync calls can also be substantially 
reduced.
1) for the number of fsync calls: it would be only 1 for every DRS if 
DRS-collocated data is written into one container("row groups"), which is a 
little bit radical;
2) for the number of disk seeks: it's reduced indeed for the writing, but not 
for the reading.

but in general, it's a very good idea because the disk io will be mitigated 
while writing, and that will help improve reading(scan).



--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: helifu 
Gerrit-Comment-Date: Tue, 19 Mar 2019 09:11:27 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-02-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12425 )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc@317
PS1, Line 317: 
CHECK_OK(bm_->CreateBlock(CreateBlockOptions().with_tablet_id(test_tablet_name_),
 ));
Would also be interesting to choose some blocks at random to buffer.


http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/block_manager-stress-test.cc@350
PS1, Line 350: CHECK_OK(dirty_block->Finalize());
And maybe randomly select some blocks NOT to finalize.



--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 13 Feb 2019 17:39:29 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-02-12 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12425 )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13
PS1, Line 13: This means that, instead of each flush
: requiring its own LBM container, many of the output blocks can be
: consolidated into a small number of new containers (typically one 
per
: disk).
> Yea, the skepticism there is that, even if we put the blocks for a column i
We talked about this on Slack; I'm posting a summary for posterity.

The flush itself preserves key ordering within the rowset. As such, ordering 
rowsets by "time of flush" could yield a sequential scan (provided it's 
UNORDERED). However, that flush time ordering is discarded when a new rowset 
enters the RowSetTree: when queried, the tree will yield intersecting rowsets 
in start key order. If we cared about sequential block access within a 
container, we'd have to reorder the rowsets by rowset ID before scanning. 
Sequential scanning vs. buffering of written data presents a read/write 
tradeoff, so even long term it may make sense to support both and allow users 
to choose on a by-table basis.

Could you include a brief summary of this discussion in your commit message?



--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 13 Feb 2019 07:47:41 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-02-12 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12425 )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13
PS1, Line 13: This means that, instead of each flush
: requiring its own LBM container, many of the output blocks can be
: consolidated into a small number of new containers (typically one 
per
: disk).
> While certainly an improvement over the random block-to-container assignmen
Yea, the skepticism there is that, even if we put the blocks for a column in 
one container, we might not be flushing in exactly key order, so doing so still 
doesn't allow you to just do a big sequential read of that column.

The "row-oriented"-ness here is sorta half true. What we end up with is 
typically referred to as "PAX" -- essentially you divide into blocks of rows, 
and within the block, its' columnar. Parquet calls these blocks "row groups". 
We call them "DRS". Whether we lay out the column chunks in one container or 
across a bunch isn't really too different, right? (Parquet always puts them in 
a single file)


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@27
PS1, Line 27: up to 2x overhead from buffer size-doubling
> Not following this bit; there's just one buffer per writable block, so wher
just because append()ing 10 bytes to a 32kb buffer will produce a 64kb buffer. 
In the limit, a faststring may have a capacity twice its current length. I'll 
try to clarify.

We could improve this max overhead by using an iovec-like chain of smaller 
buffers of course but I was lazy so I went with the quickest prototype for a 
buffer.


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@42
PS1, Line 42:"data dirs.queue_time_us" : 1847,
:"data dirs.run_cpu_time_us" : 409,
:"data dirs.run_wall_time_us" : 21185,
> Were these edited out of "with the change"? If so, could drop them from her
hmm I actually don't recall if I edited them out or they just weren't there.


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@66
PS1, Line 66:"lbm_writes_lt_1ms" : 1419
> This is almost an order of magnitude less than before the change. Any thoug
total number of writes is less because we are only doing one write when we 
flush the buffer, whereas before we'd call block->Append() it would do a write 
directly to the disk


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@73
PS1, Line 73: On a spinning disk the reduction from 450
: fsyncs to 5 fsyncs is likely to be even more impactful in terms 
of wall
: clock.
> Yeah I'd love to see some numbers on a spinning disk.
I think Grant was going to try this out on a real workload on spinning disks?



--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 13 Feb 2019 07:15:28 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-02-12 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12425 )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..


Patch Set 1:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@9
PS1, Line 9: This adds the ability to the LogBlockManager to write a block that 
is
   : fully buffered in memory before eventually being flushed.
Worth noting that if this behavior becomes the default, the LogWritableBlock 
finalize vs. close code could be drastically simplified (in effect, there'd be 
no difference between the two because an in-memory writable block does not 
occupy a container).


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@13
PS1, Line 13: This means that, instead of each flush
: requiring its own LBM container, many of the output blocks can be
: consolidated into a small number of new containers (typically one 
per
: disk).
While certainly an improvement over the random block-to-container assignment we 
have going on now, I don't know how this will work when considering the longer 
term vision of "put all blocks belonging to one column in one container" (i.e. 
to reduce the number of seeks needed to scan a particular column's data).

I know you've voiced skepticism about this in the past; could you remind me 
what your concerns were? It's also interesting to consider this in light of 
what you write in the next paragraph, as storing DRS-colocated data in one 
container is decidedly "row-oriented" storage.


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@24
PS1, Line 24: --diskrowset-writer-buffer-data
Nit: although valid, we still prefer "--foo_bar_baz" as the canonical gflag 
name to make searching easier, and that ought to apply to searching commit 
history too.


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@27
PS1, Line 27: up to 2x overhead from buffer size-doubling
Not following this bit; there's just one buffer per writable block, so where's 
the doubling effect come from?


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@42
PS1, Line 42:"data dirs.queue_time_us" : 1847,
:"data dirs.run_cpu_time_us" : 409,
:"data dirs.run_wall_time_us" : 21185,
Were these edited out of "with the change"? If so, could drop them from here 
too.

I'm also curious as to what this actually was. FlushMRSOp doesn't delete 
blocks, so there's no hole punching. I can't see how else the flush path might 
do some work on the data dir threadpools.


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@66
PS1, Line 66:"lbm_writes_lt_1ms" : 1419
This is almost an order of magnitude less than before the change. Any thoughts 
on why?


http://gerrit.cloudera.org:8080/#/c/12425/1//COMMIT_MSG@73
PS1, Line 73: On a spinning disk the reduction from 450
: fsyncs to 5 fsyncs is likely to be even more impactful in terms 
of wall
: clock.
Yeah I'd love to see some numbers on a spinning disk.


http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager-test.cc@517
PS1, Line 517:   // TODO(todd) add tests for aborting buffered blocks and 
making sure the metrics
 :   // are appropriate
Looks like you did this below.


http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc@20
PS1, Line 20: #include 
Let's use cerrno instead.


http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/fs/log_block_manager.cc@1667
PS1, Line 1667:   RETURN_NOT_OK(manager_->GetOrCreateContainer(opts_, 
_));
What does it mean for this to fail? Before this change, creating a container 
was tightly coupled to block creation; if the former failed, so did the latter. 
That's no longer the case. What happens if we run out of disk space trying to 
create a new container for a block with some data in its buffer?


http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/tablet/delta_compaction.cc
File src/kudu/tablet/delta_compaction.cc:

http://gerrit.cloudera.org:8080/#/c/12425/1/src/kudu/tablet/delta_compaction.cc@278
PS1, Line 278: Status MajorDeltaCompaction::OpenRedoDeltaFileWriter() {
Why not extend buffering to the redo/undo blocks here? They're logically 
grouped into a transaction with the new base data, so why not buffer and 
colocate them too?



--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425

[kudu-CR] KUDU-2693. Buffer blocks while flushing rowsets

2019-02-09 Thread Todd Lipcon (Code Review)
Hello Mike Percy, Andrew Wong, Adar Dembo, Grant Henke,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/12425

to review the following change.


Change subject: KUDU-2693. Buffer blocks while flushing rowsets
..

KUDU-2693. Buffer blocks while flushing rowsets

This adds the ability to the LogBlockManager to write a block that is
fully buffered in memory before eventually being flushed. By using this
in the DiskRowSet writing code path, we're able to write out all column
blocks in parallel into memory buffers and then flush them serially
when we roll the DRS writer. This means that, instead of each flush
requiring its own LBM container, many of the output blocks can be
consolidated into a small number of new containers (typically one per
disk).

By consolidating the writes into a small number of containers, the
number of disk seeks and fsync calls can also be substantially reduced.
This is likely to also reduce LBM fragmentation if the DRS is later
deleted (eg compacted away) since all of the data for the given DRS
should be contiguous in a container.

The feature is enabled by a new flag '--diskrowset-writer-buffer-data'.
It is disabled by default because there is a slight memory cost
associated: likely up to 64MB per maintenance thread given the default
32MB rolling threshold and up to 2x overhead from buffer size-doubling.
We should benchmark and validate the improvement on a real workload
before enabling by default.

I tested this by setting up a local TS with two data dirs and writing a
table with 201 columns. I then looked at the trace metrics from
FlushMRSOp.

Without the new flag enabled, I saw metrics like the following (edited
to retain only the relevant ones):

Timing: real 7.492s user 6.039s sys 0.267s
{
   "bytes_written" : 267764138,
   "cfile_init" : 7,
   "data dirs.queue_time_us" : 1847,
   "data dirs.run_cpu_time_us" : 409,
   "data dirs.run_wall_time_us" : 21185,
   "drs_written" : 7,
   "fdatasync" : 451,< lots of fsyncs
   "fdatasync_us" : 903636,
   "lbm_writes_1-10_ms" : 1,
   "lbm_writes_lt_1ms" : 9988,
   "lbm_write_time_us" : 160774,
   "rows_written" : 643667,
   "threads_started" : 8,
   "thread_start_us" : 296,
}

With the change:

Timing: real 6.719s user 6.335s sys 0.192s
{
   "bytes_written" : 270056800,
   "cfile_init" : 7,
   "drs_written" : 7,
   "fdatasync" : 5,   <--- big reduction
   "fdatasync_us" : 88812,
   "lbm_writes_1-10_ms" : 2,
   "lbm_writes_lt_1ms" : 1419
   "lbm_write_time_us" : 131746,
   "rows_written" : 649506,
   "threads_started" : 1,
   "thread_start_us" : 34,
}

This test was on my local SSD. On a spinning disk the reduction from 450
fsyncs to 5 fsyncs is likely to be even more impactful in terms of wall
clock.

Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/delta_compaction.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/diskrowset.cc
M src/kudu/tablet/multi_column_writer.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/util/env_posix.cc
13 files changed, 263 insertions(+), 62 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/12425/1
--
To view, visit http://gerrit.cloudera.org:8080/12425
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Mike Percy