Re: Why RowSet size is much smaller than flush_threshold_mb

Todd Lipcon Fri, 15 Jun 2018 08:42:02 -0700

Also, keep in mind that when the MRS flushes, it flushes into a bunch of
separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of
compactions, because each of these 32MB rowsets is non-overlapping. In
other words, if your MRS contains rows A-Z, the output RowSets will include
[A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
never need to be compacted with each other. The net result, here, is that
compaction becomes more fine-grained and only needs to operate on
sub-ranges of the tablet where there is a lot of overlap.

You can read more about this in docs/design-docs/compaction-policy.md, in
particular the section "Limiting RowSet Sizes"

Hope that helps
-Todd

On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wdberke...@gmail.com>
wrote:

> The op seen in the logs is a rowset compaction, which takes existing
> diskrowsets and rewrites them. It's not a flush, which writes data in
> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
> compaction is done to reduce the amount of overlap of rowsets in primary
> key space, i.e. reduce the number of rowsets that might need to be checked
> to enforce the primary key constraint or find a row. Having lots of rowset
> compaction indicates that rows are being written in a somewhat random order
> w.r.t the primary key order. Kudu will perform much better as writes scale
> when rows are inserted roughly in increasing order per tablet.
>
> Also, because you are using the log block manager (the default and only
> one suitable for production deployments), there isn't a 1-1 relationship
> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
> diskrowsets will be put together in a container file.
>
> Config parameters that might be relevant here:
> --maintenance_manager_num_threads
> --fs_data_dirs (how many)
> --fs_wal_dir (is it shared on a device with the data dir?)
>
> The metrics from the compact row sets op indicates the time is spent in
> fdatasync and in reading (likely reading the original rowsets). The overall
> compaction time is kinda long but not crazy long. What's the performance
> you are seeing and what is the performance you would like to see?
>
> -Will
>
> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <huang_quanl...@126.com>
> wrote:
>
>> Hi all,
>>
>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
>> server, I find most of the compactions are compacting small files (~40MB
>> for each). For example:
>>
>> I0615 07:22:42.637351 30614 tablet.cc:1661] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: stage 1 complete, picked 4 rowsets to compact
>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
>> compact:
>> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
>> on disk: ~40666600 bytes)
>> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current size
>> on disk: ~34720852 bytes)
>> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current size
>> on disk: ~29914833 bytes)
>> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current size
>> on disk: ~29007249 bytes)
>> I0615 07:22:42.637428 30614 tablet.cc:1447] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
>> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
>> {6263071556616208384})}]
>> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:54.762563 30614 tablet.cc:1532] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
>> I0615 07:22:54.773572 30614 tablet.cc:1587] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction Phase 2: carrying over any updates which arrived during Phase 1
>> I0615 07:22:54.773599 30614 tablet.cc:1589] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
>> in {6263071556616208384})}]
>> I0615 07:22:55.189757 30614 tablet.cc:1631] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction successful on 82987 rows (123387929 bytes)
>> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
>> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user
>> 1.460s sys 0.410s
>> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
>> 70f3e54fe0f3490cbf0371a6830a33a7: 
>> CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba)
>> metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfi
>> le_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
>> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
>> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
>> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms
>> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32,"
>> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_
>> time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":
>> 25,"spinlock_wait_cycles":155264,"tcmalloc_contention_
>> cycles":768,"thread_start_us":677,"threads_started":14,"wal-
>> append.queue_time_us":300}
>>
>> The flush_threshold_mb is set in the default value (1024). Wouldn't the
>> flushed file size be ~1GB?
>>
>> I think increasing the initial RowSet size can reduce compactions and
>> then reduce the impact of other ongoing operations. It may also improve the
>> flush performance. Is that right? If so, how can I increase the RowSet size?
>>
>> I'd be grateful if someone can make me clear about these!
>>
>> Thanks,
>> Quanlong
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Why RowSet size is much smaller than flush_threshold_mb

Reply via email to