[ 
https://issues.apache.org/jira/browse/IMPALA-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6024.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.12.0

commit 22d9ac08937f348b21075b276d487f4b1ba3524c
Author: Alex Behm <alex.b...@cloudera.com>
Date:   Mon Jan 22 23:07:25 2018 -0800

    IMPALA-6024: Min sample bytes for COMPUTE STATS TABLESAMPLE
    
    Adds a new query option COMPUTE_STATS_MIN_SAMPLE_SIZE
    which is the minimum number of bytes that will be scanned
    in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied
    sampling percent.
    
    The motivation is to prevent sampling for very small tables
    where accurate stats can be obtained cheaply without sampling.
    
    This patch changes COMPUTE STATS TABLESAMPLE to run the regular
    COMPUTE STATS if the effective sampling percent is 0% or 100%.
    For a 100% sampling rate, the sampling-based stats queries
    are more expensive and produce less accurate stats than the
    regular COMPUTE STATS.
    
    Default: COMPUTE_STATS_MIN_SAMPLE_SIZE=1GB
    
    Testing:
    - added new unit tests and ran them locally
    
    Change-Id: I2cb91a40bec50b599875109c2f7c5bf6f41c2400
    Reviewed-on: http://gerrit.cloudera.org:8080/9113
    Reviewed-by: Alex Behm <alex.b...@cloudera.com>
    Tested-by: Impala Public Jenkins


> Add minimum sample size for COMPUTE STATS TABLESAMPLE
> -----------------------------------------------------
>
>                 Key: IMPALA-6024
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6024
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 2.10.0, Impala 2.11.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Major
>             Fix For: Impala 2.12.0
>
>
> We should introduce a minimum sample size in bytes for COMPUTE STATS 
> TABLESAMPLE. Reasons:
> * For small tables sampling does not make sense. Accurate stats can be 
> obtained cheaply without sampling.
> * Very small sample sizes mostly do not make sense - some minimum of data is 
> required to get meaningful stats. 
> I think a 1GB minimum might be a good choice and ideally this minimum sample 
> size would be configurable.
> Many other DBMS have stats collection with sampling and in many cases a 
> minimum sample size is required to get any meaningful stats.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to