Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-04 Thread Yuxia Luo
Recently, I'm work with getting statistic for Hive's partitioned table[1], I would like to share my experience as a developer. I have to admit the ndv really make me confused in the first glance, but I can find what it means easily in web search engine with the keyword like "nvd statistic".

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-04 Thread Becket Qin
Hi Jing, Hmm, granularity and ndv still don't seem to mean the same thing to me. Granularity basically means how detailed the data is, in another word, whether a field / column be further divided. For example, a field like "age“ cannot be further divided so it is quite granular. In contrast, an

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-03 Thread Jark Wu
Hi Jing, I agree with you that "NDV is more SQL-oriented(implementation) and granularity is more data analytics-oriented". As you said, "granularity" may be commonly used for data modeling and business-related. However, TableStats is not used for data modeling but is an implementation detail for

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jing Ge
Thanks all for your feedback! It is very informative. to Becket: At the beginning, I chose the same word because we used it in daily work. Before I started this discussion, to make sure it is the right one, I did some checking and it turns out that *cardinality* has a very different (also very

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jingsong Li
Hi, +1 for NDV (number of distinct values) is a widely used terminology in table statistics. I've also seen the one called `distinctCount`. This name can be found in databases like oracle too. [1] So it is not good to change a completely different name. [1]

回复: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread luoyu...@alumni.sjtu.edu.cn
rg/jira/browse/FLINK-27597 发件人:Jing Ge 日期:2022年6月2日 00:21 主题:[DISCUSS] suggest using granularityNumber in ColumnStats 收件人:dev Hi Dev, I am not really sure if it is feasible to start this discussion. According to the contribution guidelines, dev ml is the right place to reach consensus. In Co

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jark Wu
Hi Jing, I can see there might be developers who don't understand the meaning at the first glance. However, NDV is a widely used terminology in table statistics, see [1][2][3]. If we use another name, it may confuse developers who are familiar with stats and optimization. I think at least, the

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Becket Qin
Hi Jing, While I do agree that NDV is a little confusing at first sight, it seems quite concise once I got the meaning. So personally I am OK with keeping it as is, but proper documentation would be helpful. If we really want to replace it with a more professional name, *cardinality* might be a

[DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-01 Thread Jing Ge
Hi Dev, I am not really sure if it is feasible to start this discussion. According to the contribution guidelines, dev ml is the right place to reach consensus. In ColumnStats, Currently ndv, which stands for "number of distinct values", is used. First of all, it is difficult to understand the