On Thu, Apr 27, 2017 at 12:38 AM, 기준 <[email protected]> wrote:
> Hi Jeszy!
>
> First of all, thanks for your kind reply.
>
> It took me some times to read the article you suggested,
> and tried to apply my environment.
>
> Question 1. Why column statistics have inaccurate value?
I think Jeszy's answer addressed why there can be inaccuracies.
> Question 2. After command `alter table analysis_data set
> tblproperties('numRows'='1001000000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> Still '# Rows' has -1 value.
>
> +--------+-----------+----------+--------------------------------------+------------+
> | # Rows | Start Key | Stop Key | Leader Replica
> | # Replicas |
> | -1 | | 00000001 | ${host1} | 3 |
> | -1 | 00000001 | 00000002 | ${host2} | 3 |
> | -1 | 00000002 | 00000003 | ${host3} | 3 |
>
>
> And another command `alter table partitioned_data partition(year=2009,
> month=4) set tblproperties ('numRows'='30000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> does not applied kudu table.
These are indicating that there are no _per-partition_ statistics. You
set the table statistics, which is different and should have been
applied. There is not yet a way to get per-partition statistics for
Kudu -- for now you'll always see -1 there.
Take a look at 'describe extended tbl' to see the num_rows property.
Also show column stats should work as well.
For example:
[localhost:21000] > compute stats tdata;
Query: compute stats tdata
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 9 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 0.51s
[localhost:21000] > show column stats tdata;
Query: show column stats tdata
+--------+-----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+-----------+------------------+--------+----------+----------+
| id | INT | 1494 | -1 | 4 | 4 |
| valf | FLOAT | 1 | -1 | 4 | 4 |
| vali | BIGINT | 1 | -1 | 8 | 8 |
| valv | STRING | 1 | -1 | 3 | 3 |
| valb | BOOLEAN | 2 | -1 | 1 | 1 |
| valt | TINYINT | 1 | -1 | 1 | 1 |
| vals | SMALLINT | 1 | -1 | 2 | 2 |
| vald | DOUBLE | 1 | -1 | 8 | 8 |
| ts | TIMESTAMP | 1 | -1 | 16 | 16 |
+--------+-----------+------------------+--------+----------+----------+
[localhost:21000] > describe extended tdata;
Query: describe extended tdata
+------------------------------+-------------------------------------------------+-------------------------------------------+
| name | type
| comment |
+------------------------------+-------------------------------------------------+-------------------------------------------+
| # col_name | data_type
| comment |
| | NULL
| NULL |
...
| | NULL
| NULL |
| # Detailed Table Information | NULL
| NULL |
| Database: | default
| NULL |
...
| Table Parameters: | NULL
| NULL |
| | DO_NOT_UPDATE_STATS
| true |
| | STATS_GENERATED_VIA_STATS_TASK
| true |
| | kudu.master_addresses
| 127.0.0.1 |
| | kudu.table_name
| impala::default.tdata |
| ==========> | numRows |
1500 |
...
>
> Is there any command set rows information into kudu's stats?
You did. You set the number of rows on the table. As I said there is
no way to set partition stats, nor would it make any difference to the
planner yet. You can reference IMPALA-2830 to track work related to
Impala-Kudu statistics in the future.
>
> Thanks! Have a nice day :)
>
>
>
>
>
> 2017-04-25 20:41 GMT+09:00 Jeszy <[email protected]>:
>> Hey,
>>
>> The difference in the distinct values is expected, the estimation that
>> the NDV function
>> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ndv.html)
>> gives is good enough, and the execution is much faster. You can set
>> both the table and the column stats manually as described here:
>> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_stats.html#perf_table_stats_manual.
>>
>> From your question it seems you expected Impala to return the column
>> statistics-stored value for the distinct count. This is not possible
>> currently, the intent with these statistics is to help planning by
>> allowing Impala to come up with a more educated guess on join ordering
>> and cardinality - if you look at a summary like the one you pasted,
>> the estimated values (both #rows and peak mem) is influenced by stats.
>> There is no way to tell whether the stats are stale (only whether they
>> are computed or not).
>>
>> Does this answer your questions?
>>
>> 2017-04-25 13:26 GMT+02:00 기준 <[email protected]>:
>>> Hi!
>>>
>>> I'm using impala-kudu currently.
>>>
>>> impala's version is v2.7.0
>>> kudu's version is 1.3
>>>
>>> I found out table statistics hint few days ago.
>>>
>>> So i tried compute statistics using command `compute stats`.
>>>
>>> After short time no errors shown my screen, but all the rows was -1.
>>>
>>> So i searched about this, then i could find this one.
>>> https://issues.apache.org/jira/browse/IMPALA-2830
>>>
>>> Question 1. Can i manually set rows?
>>>
>>> And i found column statistics computed with wrong value.
>>>
>>> For example, some column's actual distinct value was 5092153,
>>> but command `show column stats ${table}` shows 5405440.
>>> (Similar other columns too)
>>>
>>> Question 2. Why this difference happens? Also, can i set value manually?
>>>
>>> And after all i'm not clear impala use this information during query
>>> processing.
>>>
>>> For example,
>>> Issued `SELECT COUNT(DISTINCT ${column}) FROM ${table}`,
>>> and i found impala scan from kudu using `summary` command.
>>>
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
>>> Mem | Est. Peak Mem | Detail |
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>> | 06:AGGREGATE | 1 | 121.09us | 121.09us | 1 | 1 | 64.00
>>> KB | -1 B | FINALIZE |
>>> | 05:EXCHANGE | 1 | 61.82us | 61.82us | 12 | 1 | 0 B
>>> | -1 B | UNPARTITIONED |
>>> | 02:AGGREGATE | 12 | 3.71ms | 5.53ms | 12 | 1 | 16.00
>>> KB | 10.00 MB | |
>>> | 04:AGGREGATE | 12 | 171.00ms | 181.15ms | 5.09M | 5.41M |
>>> 154.58 MB | 11.57 MB | |
>>> | 03:EXCHANGE | 12 | 12.85ms | 14.27ms | 7.93M | 5.41M | 0 B
>>> | 0 B | HASH(c) |
>>> | 01:AGGREGATE | 12 | 2.72s | 4.80s | 7.93M | 5.41M |
>>> 170.08 MB | 138.88 MB | STREAMING |
>>> | 00:SCAN KUDU | 12 | 991.95ms | 5.38s | 963.00M | 963.00M | 2.30
>>> MB | 0 B | |
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>>
>>> Why impala did not used column statistics information?
>>>
>>> Question 3. If i can set statistics value manually, can impala understands
>>> that?
>>> it seems impala do not use computed statistics information.
>>>
>>> I working on this, but it's hard to know more.
>>>
>>> Thanks! Have a nice day.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>