[ https://issues.apache.org/jira/browse/IMPALA-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Behm resolved IMPALA-6024. ------------------------------------ Resolution: Fixed Fix Version/s: Impala 2.12.0 commit 22d9ac08937f348b21075b276d487f4b1ba3524c Author: Alex Behm <alex.b...@cloudera.com> Date: Mon Jan 22 23:07:25 2018 -0800 IMPALA-6024: Min sample bytes for COMPUTE STATS TABLESAMPLE Adds a new query option COMPUTE_STATS_MIN_SAMPLE_SIZE which is the minimum number of bytes that will be scanned in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied sampling percent. The motivation is to prevent sampling for very small tables where accurate stats can be obtained cheaply without sampling. This patch changes COMPUTE STATS TABLESAMPLE to run the regular COMPUTE STATS if the effective sampling percent is 0% or 100%. For a 100% sampling rate, the sampling-based stats queries are more expensive and produce less accurate stats than the regular COMPUTE STATS. Default: COMPUTE_STATS_MIN_SAMPLE_SIZE=1GB Testing: - added new unit tests and ran them locally Change-Id: I2cb91a40bec50b599875109c2f7c5bf6f41c2400 Reviewed-on: http://gerrit.cloudera.org:8080/9113 Reviewed-by: Alex Behm <alex.b...@cloudera.com> Tested-by: Impala Public Jenkins > Add minimum sample size for COMPUTE STATS TABLESAMPLE > ----------------------------------------------------- > > Key: IMPALA-6024 > URL: https://issues.apache.org/jira/browse/IMPALA-6024 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend > Affects Versions: Impala 2.10.0, Impala 2.11.0 > Reporter: Alexander Behm > Assignee: Alexander Behm > Priority: Major > Fix For: Impala 2.12.0 > > > We should introduce a minimum sample size in bytes for COMPUTE STATS > TABLESAMPLE. Reasons: > * For small tables sampling does not make sense. Accurate stats can be > obtained cheaply without sampling. > * Very small sample sizes mostly do not make sense - some minimum of data is > required to get meaningful stats. > I think a 1GB minimum might be a good choice and ideally this minimum sample > size would be configurable. > Many other DBMS have stats collection with sampling and in many cases a > minimum sample size is required to get any meaningful stats. -- This message was sent by Atlassian JIRA (v7.6.3#76005)