(Disclaimer: my previous message did not involve verification in code or turning up test cases to prove my assertions. For example, the documentation claims that we retain versions beyond the max configured when we do minor compactions but I do not see in code how that is done. Perhaps this is how it used to be. Need to dig more).
On Mon, Jun 19, 2017 at 8:27 AM, Dave Latham <[email protected]> wrote: > And for any of the cases - if not, then why not? Because that hasn't been > implemented, or there's an actual reason that HBase would not want to do > it? > Being able to delete in minor compaction would be an improvement; we are reading the data anyways. Traditionally, the spoke in the wheel is the fact that we allow edits to come in in any order -- clients can write an edit into the past or into the future -- so we can't be sure at compaction time that we see edits in their insert order. If sequenceid were a first class attribute of Cells, always present, we could rely on it figuring order. Absent sequenceid, minor compactions are always adjacent (according to the order in which they were flushed) subsets of all files in the store; with this precept, we know we can safely remove versions if in our subset we've encountered > configured max versions. > With reads for a custom time range, it's possible to still read data that > is waiting to be GCed from one of the above mechanisms and will disappear > after that happens. Doing the GC during minor compactions as well as major > ones would change that visibility window, but doesn't seem to change that > odd behavior that is there to begin with. > > Should we support retaining deletes even on major compactions for some user-configured period? Thanks D, St.Ack P.S. This section needs a tuneup: http://hbase.apache.org/book.html#compaction > On Wed, Jun 14, 2017 at 5:51 PM, Dave Latham <[email protected]> wrote: > > > What cells, if any, are removed during minor compactions? > > > > Cells that > > (a) are beyond the TTL? > > (b) are shadowed by a delete marker? (from the files compacted) > > (c) are shadowed by newer versions? (assuming numVersions configured < > num > > versions of the cell found) > > >
