Keith Turner created ACCUMULO-1266:
--------------------------------------

             Summary: Automatically determine when a full major compaction 
would benefit scans
                 Key: ACCUMULO-1266
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1266
             Project: Accumulo
          Issue Type: New Feature
            Reporter: Keith Turner


For the following situation, there is a tipping point where it becomes 
beneficial to do a full major compaction.

 * a tablet is frequently scanned
 * scan time iterators supress a lot of data
 * a full major compaction would also supress that data 

Examples of this are tablets with lots of deletes, versions that are 
suppressed, data thats combined, and data thats filtered.   

If tablet servers kept track of statistics about scans, could this be used to 
determine when its beneficial to automatically compact?  In the following 
simple example, it seems obvious that a major compaction would be beneficial. 
In this example scans over the last hour have had to examine and throw away 20 
million uneeded keys.  Alot of scan work could have been saved by doing a major 
compaction.

 * all scans over tabletA within the last hour have read 30 million keys and 
returned 10 million keys 
 * TabletA has 3 million keys
 * a major compaction would reduce tabletA to 1 million keys and result in 
future scans returning all keys read.

One complicating factor is that major compaction may have a different set of 
iterators configured.  Therefore its possible that scan may filter a lot of 
data, and major compactions may not.   Could possibly keep track of ratio of 
data dropped by compactions and the ratio of data dropped by scans.  This could 
be used when deciding if a major compaction should be done to improve scan 
performance.

What other situation can cause unnecessary major compactions and need to be 
defended against?

In the case where a compaction of just the data in memory would benefit scans, 
ACCUMULO-519 may solve the problem that this ticket is looking to solve.

So what should the formula be?  

{code:java}
  // k/v : key values
  // recentlyRead    : total number of k/v read before applying iterators by 
recent scans (recentlyRead - recentlyDropped equals # of k/v returned to users)
  // majcDropRatio   : ratio of k/v dropped by recent major compactions
  // R a user configurable ratio, like the current major compaction ratio that 
is based on files
  if((recentlyRead * majcDropRatio > R * totalKeyValues)){
     doFullMajorCompaction()
     resetScanStats()
  }
{code}

The example formula above has an issue, it may initiate a major compaction when 
scans are not reading a part of the tablet that drops data.  The formula below 
tries to remedy this.

{code:java}
  // k/v : key values
  // recentlyDropped : number of k/v dropped by recent scans
  // recentlyRead    : total number of k/v read before applying iterators by 
recent scans (recentlyRead - recentlyDropped equals # of k/v returned to users)
  // majcDropRatio   : ratio of k/v dropped by recent major compactions
  // R a user configurable ratio, like the current major compaction ratio that 
is based on files
  if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio > 
R * totalKeyValues)){
     doFullMajorCompaction()
     resetScanStats()
  }
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to