Re: Enable Impala-kudu table, column stats.

Matthew Jacobs Thu, 11 May 2017 12:55:18 -0700

On Thu, Apr 27, 2017 at 12:38 AM, 기준 <[email protected]> wrote:
> Hi Jeszy!
>
> First of all, thanks for your kind reply.
>
> It took me some times to read the article you suggested,
> and tried to apply my environment.
>
> Question 1. Why column statistics have inaccurate value?


I think Jeszy's answer addressed why there can be inaccuracies.

> Question 2. After command `alter table analysis_data set
> tblproperties('numRows'='1001000000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> Still '# Rows' has -1 value.
>
> +--------+-----------+----------+--------------------------------------+------------+
> | # Rows | Start Key | Stop Key | Leader Replica
> | # Replicas |
> | -1     |           | 00000001 | ${host1} | 3          |
> | -1     | 00000001  | 00000002 | ${host2} | 3          |
> | -1     | 00000002  | 00000003 | ${host3} | 3          |
>
>
> And another command `alter table partitioned_data partition(year=2009,
> month=4) set tblproperties ('numRows'='30000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> does not applied kudu table.

These are indicating that there are no _per-partition_ statistics. You
set the table statistics, which is different and should have been
applied. There is not yet a way to get per-partition statistics for
Kudu -- for now you'll always see -1 there.

Take a look at 'describe extended tbl' to see the num_rows property.
Also show column stats should work as well.

For example:
[localhost:21000] > compute stats tdata;
Query: compute stats tdata
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 9 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 0.51s
[localhost:21000] > show column stats tdata;
Query: show column stats tdata
+--------+-----------+------------------+--------+----------+----------+
| Column | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+-----------+------------------+--------+----------+----------+
| id     | INT       | 1494             | -1     | 4        | 4        |
| valf   | FLOAT     | 1                | -1     | 4        | 4        |
| vali   | BIGINT    | 1                | -1     | 8        | 8        |
| valv   | STRING    | 1                | -1     | 3        | 3        |
| valb   | BOOLEAN   | 2                | -1     | 1        | 1        |
| valt   | TINYINT   | 1                | -1     | 1        | 1        |
| vals   | SMALLINT  | 1                | -1     | 2        | 2        |
| vald   | DOUBLE    | 1                | -1     | 8        | 8        |
| ts     | TIMESTAMP | 1                | -1     | 16       | 16       |
+--------+-----------+------------------+--------+----------+----------+

[localhost:21000] > describe extended tdata;
Query: describe extended tdata
+------------------------------+-------------------------------------------------+-------------------------------------------+
| name                         | type
          | comment                                   |
+------------------------------+-------------------------------------------------+-------------------------------------------+
| # col_name                   | data_type
          | comment                                   |
|                              | NULL
          | NULL                                      |
...
|                              | NULL
          | NULL                                      |
| # Detailed Table Information | NULL
          | NULL                                      |
| Database:                    | default
          | NULL                                      |
...
| Table Parameters:            | NULL
          | NULL                                      |
|                              | DO_NOT_UPDATE_STATS
          | true                                      |
|                              | STATS_GENERATED_VIA_STATS_TASK
          | true                                      |
|                              | kudu.master_addresses
          | 127.0.0.1                                 |
|                              | kudu.table_name
          | impala::default.tdata                     |
|  ==========>     | numRows                                         |
1500                                      |
...


>
> Is there any command set rows information into kudu's stats?

You did. You set the number of rows on the table. As I said there is
no way to set partition stats, nor would it make any difference to the
planner yet. You can reference IMPALA-2830 to track work related to
Impala-Kudu statistics in the future.

>
> Thanks! Have a nice day :)
>
>
>
>
>
> 2017-04-25 20:41 GMT+09:00 Jeszy <[email protected]>:
>> Hey,
>>
>> The difference in the distinct values is expected, the estimation that
>> the NDV function
>> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ndv.html)
>> gives is good enough, and the execution is much faster.  You can set
>> both the table and the column stats manually as described here:
>> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_stats.html#perf_table_stats_manual.
>>
>> From your question it seems you expected Impala to return the column
>> statistics-stored value for the distinct count. This is not possible
>> currently, the intent with these statistics is to help planning by
>> allowing Impala to come up with a more educated guess on join ordering
>> and cardinality - if you look at a summary like the one you pasted,
>> the estimated values (both #rows and peak mem) is influenced by stats.
>> There is no way to tell whether the stats are stale (only whether they
>> are computed or not).
>>
>> Does this answer your questions?
>>
>> 2017-04-25 13:26 GMT+02:00 기준 <[email protected]>:
>>> Hi!
>>>
>>> I'm using impala-kudu currently.
>>>
>>> impala's version is v2.7.0
>>> kudu's version is 1.3
>>>
>>> I found out table statistics hint few days ago.
>>>
>>> So i tried compute statistics using command `compute stats`.
>>>
>>> After short time no errors shown my screen, but all the rows was -1.
>>>
>>> So i searched about this, then i could find this one.
>>> https://issues.apache.org/jira/browse/IMPALA-2830
>>>
>>> Question 1. Can i manually set rows?
>>>
>>> And i found column statistics computed with wrong value.
>>>
>>> For example, some column's actual distinct value was 5092153,
>>> but command `show column stats ${table}` shows 5405440.
>>> (Similar other columns too)
>>>
>>> Question 2. Why this difference happens? Also, can i set value manually?
>>>
>>> And after all i'm not clear impala use this information during query
>>> processing.
>>>
>>> For example,
>>> Issued `SELECT COUNT(DISTINCT ${column}) FROM ${table}`,
>>> and i found impala scan from kudu using `summary` command.
>>>
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>> | Operator     | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak
>>> Mem  | Est. Peak Mem | Detail        |
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>> | 06:AGGREGATE | 1      | 121.09us | 121.09us | 1       | 1          | 64.00
>>> KB  | -1 B          | FINALIZE      |
>>> | 05:EXCHANGE  | 1      | 61.82us  | 61.82us  | 12      | 1          | 0 B
>>> | -1 B          | UNPARTITIONED |
>>> | 02:AGGREGATE | 12     | 3.71ms   | 5.53ms   | 12      | 1          | 16.00
>>> KB  | 10.00 MB      |               |
>>> | 04:AGGREGATE | 12     | 171.00ms | 181.15ms | 5.09M   | 5.41M      |
>>> 154.58 MB | 11.57 MB      |               |
>>> | 03:EXCHANGE  | 12     | 12.85ms  | 14.27ms  | 7.93M   | 5.41M      | 0 B
>>> | 0 B           | HASH(c)       |
>>> | 01:AGGREGATE | 12     | 2.72s    | 4.80s    | 7.93M   | 5.41M      |
>>> 170.08 MB | 138.88 MB     | STREAMING     |
>>> | 00:SCAN KUDU | 12     | 991.95ms | 5.38s    | 963.00M | 963.00M    | 2.30
>>> MB   | 0 B           |               |
>>> +--------------+--------+----------+----------+---------+------------+-----------+---------------+---------------+
>>>
>>> Why impala did not used column statistics information?
>>>
>>> Question 3. If i can set statistics value manually, can impala understands
>>> that?
>>> it seems impala do not use computed statistics information.
>>>
>>> I working on this, but it's hard to know more.
>>>
>>> Thanks! Have a nice day.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Enable Impala-kudu table, column stats.

Reply via email to