[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-14 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master.

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, 
> HIVE-20660.6.patch, HIVE-20660.7.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-14 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, 
> HIVE-20660.6.patch, HIVE-20660.7.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-14 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, 
> HIVE-20660.6.patch, HIVE-20660.7.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-14 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.7.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, 
> HIVE-20660.6.patch, HIVE-20660.7.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-09 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, HIVE-20660.6.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-09 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.6.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, HIVE-20660.6.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-09 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, HIVE-20660.6.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-06 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.5.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-06 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-06 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-04 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.4.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-04 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-04 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-03 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.3.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-03 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-03 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Open  (was: Patch Available)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-10-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.2.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-09-30 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Attachment: HIVE-20660.1.patch

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

2018-09-30 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-20660:
---
Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -
>
> Key: HIVE-20660
> URL: https://issues.apache.org/jira/browse/HIVE-20660
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
> Attachments: HIVE-20660.1.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='1', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='1', 
> 'rawDataSize'='1');
> alter table t2 update statistics for column i2 
> set('numDVs'='1000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>   where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col1 (type: int)
>   1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: _col0 (type: int), _col1 (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==
>   Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type: int)
> sort order: ++
> Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
> Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col2 (type: bigint)
>   .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)