[ 
https://issues.apache.org/jira/browse/IMPALA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844229#comment-16844229
 ] 

Todd Lipcon edited comment on IMPALA-8458 at 5/20/19 7:21 PM:
--------------------------------------------------------------

I paged this code back into my head and remember why we had the weird 
workaround. The hack there was to deal with our odd handling of boolean stats. 
The LocalCatalog flow is:

- catalogd fetches stats from Hive, and converts them to our own internal 
ColumnStats object via ColumnStats.update:

{code}
          BooleanColumnStatsData boolStats = statsData.getBooleanStats();
          numNulls_ = boolStats.getNumNulls();
          numDistinctValues_ = (numNulls_ > 0) ? 3 : 2;
{code}

- impalad fetches stats from catalogd in CatalogdMetaProvider. This interface 
was originally built towards the "fetch directly from HMS" code path, so in 
this case, the wire protocol consists of the catalogd needing to send back the 
Hive ColumnStatitisticsObj type. So, we call 
ColumnStats.createHiveColStatsData() to convert the bool stats back to the Hive 
type:
{code}
      case BOOLEAN:
        colStatsData.setBooleanStats(new BooleanColumnStatsData(1, -1, 
numNulls));
        break;
{code}

When this hive object gets to the Impalad, it gets converted _back_ to Impala's 
ColumnStats type with the first code snippet above.


This Hive->Impala->Hive->Impala conversion round tripping is somewhat lossy, 
particularly for bools since Hive stores a numFalse/numTrue whereas we want to 
have an NDV. I think we also end up with "lossiness" in the case that we didn't 
find compatible stats in the HMS, since we don't really have a clear 
distinction from "we have stats with unknown NDV" vs "we dont' have stats at 
all".

I'll see if I can clean this up. Perhaps the easiest route is to have the 
wire-protocol for fetch-from-catalogd just use the impala-internal stats object.



was (Author: tlipcon):
I paged this code back into my head and remember why we had the weird 
workaround. The hack there was to deal with our odd handling of boolean stats. 
The LocalCatalog flow is:

- catalogd fetches stats from Hive, and converts them to our own internal 
ColumnStats object via ColumnStats.update:{code}
{code}
          BooleanColumnStatsData boolStats = statsData.getBooleanStats();
          numNulls_ = boolStats.getNumNulls();
          numDistinctValues_ = (numNulls_ > 0) ? 3 : 2;
{code}
- impalad fetches stats from catalogd in CatalogdMetaProvider. This interface 
was originally built towards the "fetch directly from HMS" code path, so in 
this case, the wire protocol consists of the catalogd needing to send back the 
Hive ColumnStatitisticsObj type. So, we call 
ColumnStats.createHiveColStatsData() to convert the bool stats back to the Hive 
type:
{code}
      case BOOLEAN:
        colStatsData.setBooleanStats(new BooleanColumnStatsData(1, -1, 
numNulls));
        break;
{code}

When this hive object gets to the Impalad, it gets converted _back_ to Impala's 
ColumnStats type with the first code snippet above.


This Hive->Impala->Hive->Impala conversion round tripping is somewhat lossy, 
particularly for bools since Hive stores a numFalse/numTrue whereas we want to 
have an NDV. I think we also end up with "lossiness" in the case that we didn't 
find compatible stats in the HMS, since we don't really have a clear 
distinction from "we have stats with unknown NDV" vs "we dont' have stats at 
all".

I'll see if I can clean this up. Perhaps the easiest route is to have the 
wire-protocol for fetch-from-catalogd just use the impala-internal stats object.


> Can't set numNull/maxSize/avgSize column stats with local catalog without 
> also setting NDV
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8458
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8458
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 3.3.0
>            Reporter: Tim Armstrong
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> Repro:
> {noformat}
> [tarmstrong-box2.ca.cloudera.com:21000] default> create table test_stats2(s 
> string);
> +-------------------------+
> | summary                 |
> +-------------------------+
> | Table has been created. |
> +-------------------------+
> Fetched 1 row(s) in 0.36s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats 
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s      | STRING | -1               | -1     | -1       | -1       |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set 
> column stats s('avgSize'='1234');
> +-----------------------------------------+
> | summary                                 |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.14s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats 
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s      | STRING | -1               | -1     | -1       | -1       |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set 
> column stats s('maxSize'='1234');
> +-----------------------------------------+
> | summary                                 |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.10s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats 
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s      | STRING | -1               | -1     | -1       | -1       |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> invalidate metadata 
> test_stats2;
> Fetched 0 row(s) in 0.03s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats 
> test_stats2;
> Query: show column stats test_stats2
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s      | STRING | -1               | -1     | -1       | -1       |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.07s
> {noformat}
> I expected that the updates would take effect. Weirdly it doesn't happen for 
> NDV and NULLS:
> {noformat}
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set 
> column stats s('numDVs'='1234','numNulls'='12345');
> Query: alter table test_stats2 set column stats 
> s('numDVs'='1234','numNulls'='12345')
> +-----------------------------------------+
> | summary                                 |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.12s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats 
> test_stats2;
> Query: show column stats test_stats2
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s      | STRING | 1234             | 12345  | -1       | -1       |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to