Hi all,
I'm not sure if I've understood well the purpose of major compaction and
how to handle it in a busy system.
It is important to run major compaction when we have a lot of deleted
data, as it removes the "marked as deleted" flags.
There are also the "flush" and "minor compaction" operations associated
with the writing on disk. I understand that in minor compaction many
files resulted from flush operations are written in only one file. What
is not very clear is whether major compaction does the same operation
(and so it can be skipped if no deletes are in the system) or there is
also a particular operation which is not done in minor compaction and
skipping it may affect the performance or volume.
An other thing that I'd like you to help me clarifying is if major
compaction on all dataset is the sum of major compaction of all regions.
If so, it is possible to major compact only some regions at a time, and
other regions at other time. I also don't understand well if it is
possible for the system to merge a region with less data with other
region and if it does, which of the mentioned operations might affect
the good system behavior(i.e. what NOT to do).
The last point is regarding the files in HDFS (this might affect the
volume). When is the data deleted from HDFS(in minor and major
compaction)? Are the files deleted when a compaction is performed or
they are only marked as deleted?
Thank you,
Iulia
--
Iulia Zidaru
Java Developer
1&1 Internet AG - Bucharest/Romania - Web Components Romania
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
[email protected]
0040 31 223 9153