Hi,
I am trying to get some baselines for capacity planning. The approach i
took was to insert increasing number of rows into a replica of the table to
sized, watch the size of the "data" directory (after doing nodetool flush
and compact), and calculate the average size per row (total directory
size/count of rows). Can this be considered a valid approach to extrapolate
for future growth of data ?
Related to this, is there any information we can gather from partition-size
of cfhistograms (snipped output for my table below) :
Partition Size (bytes)
642 bytes: 221
770 bytes: 2328
924 bytes: 328858
..
8239 bytes: 153178
...
24601 bytes: 16973
29521 bytes: 10805
...
219342 bytes: 23
263210 bytes: 6
315852 bytes: 4
It seems the size in cfhisto has a wide variation with the calculated value
using the approach detailed above (avg 2KB/row). Could this difference be
due to compression, or are there any other factors at play here? . What
would be the typical use/interpretation of the "partition size" metric.
The table definition is like :
CREATE TABLE abc (
key1 text,
col1 text,
PRIMARY KEY ((key1))
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'sstable_size_in_mb': '50', 'class':
'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Thanks,
Joseph