Someone please correct me if I am wrong.
I've looked into this recently due to some performance reasons with my tables
in a production environment. Like the books says, I don't recommend keeping
this many versions around unless you really need them. Telling HBase to keep
around a very large number doesn't waste space, that's just a value in the
table descriptor. So, I wouldn't worry about that. The problems are going to
come in when you actually write out those versions.
My tables currently have max_versions set and roughly 40% of the tables are due
to historical versions. So, one table in particular is around 25 TB. I don't
have a need to keep this many versions, so I am working on changing the max
versions to the default of 3 (some cells are hundreds or thousands of cells
deep). The issue youll run into is scan performance on the region servers
needing to scan over all that data you may not need (due to large store files).
This could lead to increased scan time and potentially scanner timeouts,
depending upon how large your batch size is set on the scan.
I assume this has some performance impact on compactions, both minor and major,
but I didn't investigate that, and potentially on the write path, but also not
something I looked into.
Changing the number of versions after the table has been created doesn't have a
performance impact due to just being a metadata change. The table will need to
be disabled, changed, and re-enabled again. If this is done through a script
the table could be offline for a couple of seconds. The only concern around
that are users of the table. If they have scheduled job runs that hit that
table that would break if they try to read from it while the table is disabled.
The only performance impact I can think of around this change would be major
compaction of the table, but even that shouldn't be an issue.
_____________________________
From: Daniel <[email protected]>
Sent: Sunday, February 21, 2016 9:22 AM
Subject: Two questions about the maximum number of versions of a column family
To: user <[email protected]>
Hi, I have two questions about the maximum number of versions of a column
family:
(1) Is it OK to set a very large (>100,000) maximum number of versions for a
column family?
The reference guide says "It is not recommended setting the number of max
versions to an exceedingly high level (e.g., hundreds or more) unless those old
values are very dear to you because this will greatly increase StoreFile size."
(Chapter 36.1)
I'm new to the Hadoop ecosystem, and have no idea about the consequences of a
very large StoreFile size.
Furthermore, it is OK to set a large maximum number of versions but insert only
a few versions? Does it waste space?
(2) How much performance overhead does it cause to increase the maximum number
of versions of a column family after enormous (e.g. billions) rows have been
inserted?
Regards,
Daniel