Hi all,

I'm not sure if I've understood well the purpose of major compaction and how to handle it in a busy system. It is important to run major compaction when we have a lot of deleted data, as it removes the "marked as deleted" flags. There are also the "flush" and "minor compaction" operations associated with the writing on disk. I understand that in minor compaction many files resulted from flush operations are written in only one file. What is not very clear is whether major compaction does the same operation (and so it can be skipped if no deletes are in the system) or there is also a particular operation which is not done in minor compaction and skipping it may affect the performance or volume.

An other thing that I'd like you to help me clarifying is if major compaction on all dataset is the sum of major compaction of all regions. If so, it is possible to major compact only some regions at a time, and other regions at other time. I also don't understand well if it is possible for the system to merge a region with less data with other region and if it does, which of the mentioned operations might affect the good system behavior(i.e. what NOT to do).

The last point is regarding the files in HDFS (this might affect the volume). When is the data deleted from HDFS(in minor and major compaction)? Are the files deleted when a compaction is performed or they are only marked as deleted?

Thank you,
Iulia



--
Iulia Zidaru
Java Developer

1&1 Internet AG - Bucharest/Romania - Web Components Romania
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
[email protected]
0040 31 223 9153

Reply via email to