[jira] [Commented] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623051#comment-16623051
 ] 

Paul Rogers commented on IMPALA-7604:
-

[~tarmstrong], in my experience, using planner memory estimates for resource 
planning is quite difficult. Cardinality estimates may be off by an order of 
magnitude or more. (Consider a query with {{a > 3000}}; what is the reduction 
factor? Or, what is the reduction factor for {{b LIKE '%A%'}}?).

And, of course, the planner currently throws away both of those predicates when 
computing selectivity since it can compute selectivity only for equality. (I 
filed a bug for this.) This means that existing estimates are generally 
overstated -- but we don't know by how much.

Planner cardinalities are best when used as a cost estimate to choose between 
alternative plans. In this case, the noise in the signal affects both sides, 
and the numbers are generally orders of magnitude apart, so some jiggle is not 
a problem. This is why some planners use {{double}} for the cardinality 
estimate: since it is just a rough guess, the slight inaccuracy of a floating 
point number is no problem, but the number can scale to huge values.

I wonder, do we use the cardinality to set a budget, then can we spill if it 
turns out if our estimates are wrong? (The {{a = 100}} condition turns out to 
exclude only a few records instead of, say 90%.) Do we read records into 
batches up to a memory size limit, or do we read a fixed number of records per 
batch computed from average row width? If based on row width, then we also have 
noise issues since columns can vary around the average, and knowing the width 
of the output some string functions is very difficult. How do we handle this?

> In AggregationNode.computeStats, handle cardinality overflow better
> ---
>
> Key: IMPALA-7604
> URL: https://issues.apache.org/jira/browse/IMPALA-7604
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Assignee: Tim Armstrong
>Priority: Minor
>
> Consider the cardinality overflow logic in 
> [{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
>  Current code:
> {noformat}
> // if we ended up with an overflow, the estimate is certain to be wrong
> if (cardinality_ < 0) cardinality_ = -1;
> {noformat}
> This code has a number of issues.
> * The check is done after looping over all conjuncts. It could be that, as a 
> result, the number overflowed twice. The check should be done after each 
> multiplication.
> * Since we know that the number overflowed, a better estimate of the total 
> count is {{Long.MAX_VALUE}}.
> * The code later checks for the -1 value and, if found, uses the cardinality 
> of the first child. This is a worse estimate than using the max value, since 
> the first child might have a low cardinality (it could be the later children 
> that caused the overflow.)
> * If we really do expect overflow, then we are dealing with very large 
> numbers. Being accurate to the row is not needed. Better to use a {{double}} 
> which can handle the large values.
> Since overflow probably seldom occurs, this is not an urgent issue. Though, 
> if overflow does occur, the query is huge, and having at least some estimate 
> of the hugeness is better than none. Also, seems that this code probably 
> evolved; this newbie is looking at it fresh and seeing that the accumulated 
> fixes could be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7367) Pack StringValue, CollectionValue and TimestampValue slots

2018-09-20 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622962#comment-16622962
 ] 

Tim Armstrong commented on IMPALA-7367:
---

That's a huge win. I can see it might make a difference for larger scale 
factors since more data will fit in various levels of cache. Just to check, we 
ran end-to-end tests right? I.e. there's not a bug or something.

> Pack StringValue, CollectionValue and TimestampValue slots
> --
>
> Key: IMPALA-7367
> URL: https://issues.apache.org/jira/browse/IMPALA-7367
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Pooja Nilangekar
>Priority: Major
>  Labels: perfomance
> Attachments: 0001-WIP.patch
>
>
> This is a follow-on to finish up the work from IMPALA-2789. IMPALA-2789 
> didn't actually fully pack the memory layout because StringValue, 
> TimestampValue and CollectionValue still occupy 16 bytes but only have 12 
> bytes of actual data. This results in a higher memory footprint, which leads 
> to higher memory requirements and worse performance. We don't get any benefit 
> from the padding since the majority of tuples are not actually aligned in 
> memory anyway.
> I did a quick version of the change for StringValue only which improves TPC-H 
> performance.
> {noformat}
> Report Generated on 2018-07-30
> Run Description: "b5608264b4552e44eb73ded1e232a8775c3dba6b vs 
> f1e401505ac20c0400eec819b9196f7f506fb927"
> Cluster Name: UNKNOWN
> Lab Run Info: UNKNOWN
> Impala Version:  impalad version 3.1.0-SNAPSHOT RELEASE ()
> Baseline Impala Version: impalad version 3.1.0-SNAPSHOT RELEASE (2018-07-27)
> +--+---+-++++
> | Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
> Delta(GeoMean) |
> +--+---+-++++
> | TPCH(10) | parquet / none / none | 2.69| -4.78% | 2.09   | 
> -3.11% |
> +--+---+-++++
> +--+--+---++-++++-+---+
> | Workload | Query| File Format   | Avg(s) | Base Avg(s) | 
> Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
> +--+--+---++-++++-+---+
> | TPCH(10) | TPCH-Q22 | parquet / none / none | 0.94   | 0.93|   
> +0.75%   |   3.37%|   2.84%| 1   | 30|
> | TPCH(10) | TPCH-Q13 | parquet / none / none | 3.32   | 3.32|   
> +0.13%   |   1.74%|   2.09%| 1   | 30|
> | TPCH(10) | TPCH-Q11 | parquet / none / none | 0.99   | 0.99|   
> -0.02%   |   3.74%|   3.16%| 1   | 30|
> | TPCH(10) | TPCH-Q5  | parquet / none / none | 2.30   | 2.33|   
> -0.96%   |   2.15%|   2.45%| 1   | 30|
> | TPCH(10) | TPCH-Q2  | parquet / none / none | 1.55   | 1.57|   
> -1.45%   |   1.65%|   1.49%| 1   | 30|
> | TPCH(10) | TPCH-Q8  | parquet / none / none | 2.89   | 2.93|   
> -1.51%   |   2.69%|   1.34%| 1   | 30|
> | TPCH(10) | TPCH-Q9  | parquet / none / none | 5.96   | 6.06|   
> -1.63%   |   1.34%|   1.82%| 1   | 30|
> | TPCH(10) | TPCH-Q20 | parquet / none / none | 1.58   | 1.61|   
> -1.85%   |   2.28%|   2.16%| 1   | 30|
> | TPCH(10) | TPCH-Q16 | parquet / none / none | 1.18   | 1.21|   
> -2.11%   |   3.68%|   4.72%| 1   | 30|
> | TPCH(10) | TPCH-Q3  | parquet / none / none | 2.13   | 2.18|   
> -2.31%   |   2.09%|   1.92%| 1   | 30|
> | TPCH(10) | TPCH-Q15 | parquet / none / none | 1.86   | 1.90|   
> -2.52%   |   2.06%|   2.22%| 1   | 30|
> | TPCH(10) | TPCH-Q17 | parquet / none / none | 1.85   | 1.90|   
> -2.86%   |   10.00%   |   8.02%| 1   | 30|
> | TPCH(10) | TPCH-Q10 | parquet / none / none | 2.58   | 2.66|   
> -2.93%   |   1.68%|   6.49%| 1   | 30|
> | TPCH(10) | TPCH-Q14 | parquet / none / none | 1.37   | 1.42|   
> -3.22%   |   3.35%|   6.24%| 1   | 30|
> | TPCH(10) | TPCH-Q18 | parquet / none / none | 4.99   | 5.17|   
> -3.38%   |   1.75%|   3.82%| 1   | 30|
> | TPCH(10) | TPCH-Q6  | parquet / none / none | 0.66   | 0.69|   
> -3.73%   |   5.04%|   

[jira] [Updated] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Tim Armstrong (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-7604:
--
Target Version: Impala 3.1.0

> In AggregationNode.computeStats, handle cardinality overflow better
> ---
>
> Key: IMPALA-7604
> URL: https://issues.apache.org/jira/browse/IMPALA-7604
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Assignee: Tim Armstrong
>Priority: Minor
>
> Consider the cardinality overflow logic in 
> [{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
>  Current code:
> {noformat}
> // if we ended up with an overflow, the estimate is certain to be wrong
> if (cardinality_ < 0) cardinality_ = -1;
> {noformat}
> This code has a number of issues.
> * The check is done after looping over all conjuncts. It could be that, as a 
> result, the number overflowed twice. The check should be done after each 
> multiplication.
> * Since we know that the number overflowed, a better estimate of the total 
> count is {{Long.MAX_VALUE}}.
> * The code later checks for the -1 value and, if found, uses the cardinality 
> of the first child. This is a worse estimate than using the max value, since 
> the first child might have a low cardinality (it could be the later children 
> that caused the overflow.)
> * If we really do expect overflow, then we are dealing with very large 
> numbers. Being accurate to the row is not needed. Better to use a {{double}} 
> which can handle the large values.
> Since overflow probably seldom occurs, this is not an urgent issue. Though, 
> if overflow does occur, the query is huge, and having at least some estimate 
> of the hugeness is better than none. Also, seems that this code probably 
> evolved; this newbie is looking at it fresh and seeing that the accumulated 
> fixes could be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Tim Armstrong (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-7604:
-

Assignee: Tim Armstrong

> In AggregationNode.computeStats, handle cardinality overflow better
> ---
>
> Key: IMPALA-7604
> URL: https://issues.apache.org/jira/browse/IMPALA-7604
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Assignee: Tim Armstrong
>Priority: Minor
>
> Consider the cardinality overflow logic in 
> [{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
>  Current code:
> {noformat}
> // if we ended up with an overflow, the estimate is certain to be wrong
> if (cardinality_ < 0) cardinality_ = -1;
> {noformat}
> This code has a number of issues.
> * The check is done after looping over all conjuncts. It could be that, as a 
> result, the number overflowed twice. The check should be done after each 
> multiplication.
> * Since we know that the number overflowed, a better estimate of the total 
> count is {{Long.MAX_VALUE}}.
> * The code later checks for the -1 value and, if found, uses the cardinality 
> of the first child. This is a worse estimate than using the max value, since 
> the first child might have a low cardinality (it could be the later children 
> that caused the overflow.)
> * If we really do expect overflow, then we are dealing with very large 
> numbers. Being accurate to the row is not needed. Better to use a {{double}} 
> which can handle the large values.
> Since overflow probably seldom occurs, this is not an urgent issue. Though, 
> if overflow does occur, the query is huge, and having at least some estimate 
> of the hugeness is better than none. Also, seems that this code probably 
> evolved; this newbie is looking at it fresh and seeing that the accumulated 
> fixes could be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622959#comment-16622959
 ] 

Tim Armstrong commented on IMPALA-7604:
---

This is potentially impactful for memory estimates. I think we can get huge 
estimates here easily with "SELECT DISTINCT *" since it multiplies the 
cardinality. I'd like to take a look. I'll assign to myself now.

> In AggregationNode.computeStats, handle cardinality overflow better
> ---
>
> Key: IMPALA-7604
> URL: https://issues.apache.org/jira/browse/IMPALA-7604
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the cardinality overflow logic in 
> [{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
>  Current code:
> {noformat}
> // if we ended up with an overflow, the estimate is certain to be wrong
> if (cardinality_ < 0) cardinality_ = -1;
> {noformat}
> This code has a number of issues.
> * The check is done after looping over all conjuncts. It could be that, as a 
> result, the number overflowed twice. The check should be done after each 
> multiplication.
> * Since we know that the number overflowed, a better estimate of the total 
> count is {{Long.MAX_VALUE}}.
> * The code later checks for the -1 value and, if found, uses the cardinality 
> of the first child. This is a worse estimate than using the max value, since 
> the first child might have a low cardinality (it could be the later children 
> that caused the overflow.)
> * If we really do expect overflow, then we are dealing with very large 
> numbers. Being accurate to the row is not needed. Better to use a {{double}} 
> which can handle the large values.
> Since overflow probably seldom occurs, this is not an urgent issue. Though, 
> if overflow does occur, the query is huge, and having at least some estimate 
> of the hugeness is better than none. Also, seems that this code probably 
> evolved; this newbie is looking at it fresh and seeing that the accumulated 
> fixes could be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Tim Armstrong (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-7604:
--
Component/s: Frontend

> In AggregationNode.computeStats, handle cardinality overflow better
> ---
>
> Key: IMPALA-7604
> URL: https://issues.apache.org/jira/browse/IMPALA-7604
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the cardinality overflow logic in 
> [{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
>  Current code:
> {noformat}
> // if we ended up with an overflow, the estimate is certain to be wrong
> if (cardinality_ < 0) cardinality_ = -1;
> {noformat}
> This code has a number of issues.
> * The check is done after looping over all conjuncts. It could be that, as a 
> result, the number overflowed twice. The check should be done after each 
> multiplication.
> * Since we know that the number overflowed, a better estimate of the total 
> count is {{Long.MAX_VALUE}}.
> * The code later checks for the -1 value and, if found, uses the cardinality 
> of the first child. This is a worse estimate than using the max value, since 
> the first child might have a low cardinality (it could be the later children 
> that caused the overflow.)
> * If we really do expect overflow, then we are dealing with very large 
> numbers. Being accurate to the row is not needed. Better to use a {{double}} 
> which can handle the large values.
> Since overflow probably seldom occurs, this is not an urgent issue. Though, 
> if overflow does occur, the query is huge, and having at least some estimate 
> of the hugeness is better than none. Also, seems that this code probably 
> evolved; this newbie is looking at it fresh and seeing that the accumulated 
> fixes could be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7604:
---

 Summary: In AggregationNode.computeStats, handle cardinality 
overflow better
 Key: IMPALA-7604
 URL: https://issues.apache.org/jira/browse/IMPALA-7604
 Project: IMPALA
  Issue Type: Improvement
Affects Versions: Impala 2.12.0
Reporter: Paul Rogers


Consider the cardinality overflow logic in 
[{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
 Current code:

{noformat}
// if we ended up with an overflow, the estimate is certain to be wrong
if (cardinality_ < 0) cardinality_ = -1;
{noformat}

This code has a number of issues.

* The check is done after looping over all conjuncts. It could be that, as a 
result, the number overflowed twice. The check should be done after each 
multiplication.
* Since we know that the number overflowed, a better estimate of the total 
count is {{Long.MAX_VALUE}}.
* The code later checks for the -1 value and, if found, uses the cardinality of 
the first child. This is a worse estimate than using the max value, since the 
first child might have a low cardinality (it could be the later children that 
caused the overflow.)
* If we really do expect overflow, then we are dealing with very large numbers. 
Being accurate to the row is not needed. Better to use a {{double}} which can 
handle the large values.

Since overflow probably seldom occurs, this is not an urgent issue. Though, if 
overflow does occur, the query is huge, and having at least some estimate of 
the hugeness is better than none. Also, seems that this code probably evolved; 
this newbie is looking at it fresh and seeing that the accumulated fixes could 
be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7604) In AggregationNode.computeStats, handle cardinality overflow better

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7604:
---

 Summary: In AggregationNode.computeStats, handle cardinality 
overflow better
 Key: IMPALA-7604
 URL: https://issues.apache.org/jira/browse/IMPALA-7604
 Project: IMPALA
  Issue Type: Improvement
Affects Versions: Impala 2.12.0
Reporter: Paul Rogers


Consider the cardinality overflow logic in 
[{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
 Current code:

{noformat}
// if we ended up with an overflow, the estimate is certain to be wrong
if (cardinality_ < 0) cardinality_ = -1;
{noformat}

This code has a number of issues.

* The check is done after looping over all conjuncts. It could be that, as a 
result, the number overflowed twice. The check should be done after each 
multiplication.
* Since we know that the number overflowed, a better estimate of the total 
count is {{Long.MAX_VALUE}}.
* The code later checks for the -1 value and, if found, uses the cardinality of 
the first child. This is a worse estimate than using the max value, since the 
first child might have a low cardinality (it could be the later children that 
caused the overflow.)
* If we really do expect overflow, then we are dealing with very large numbers. 
Being accurate to the row is not needed. Better to use a {{double}} which can 
handle the large values.

Since overflow probably seldom occurs, this is not an urgent issue. Though, if 
overflow does occur, the query is huge, and having at least some estimate of 
the hugeness is better than none. Also, seems that this code probably evolved; 
this newbie is looking at it fresh and seeing that the accumulated fixes could 
be tidied up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622856#comment-16622856
 ] 

Paul Rogers edited comment on IMPALA-7310 at 9/21/18 12:04 AM:
---

The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:
 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that a variation of the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is 
other than -1. Since NDV will always not be some value, change this method to 
check only {{numNulls_}}, which will continue to be -1 without stats.

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV <= 
10, add one to NDV to account for nulls. (Do this always, since a column that 
claims to be non-null can eventually become null as the result of an outer 
join.)

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.


was (Author: paul.rogers):
The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:
 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is 
other than -1. Since NDV will always not be some value, change this method to 
check only {{numNulls_}}, which will continue to be -1 without stats.

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
>   

[jira] [Comment Edited] (IMPALA-7367) Pack StringValue, CollectionValue and TimestampValue slots

2018-09-20 Thread Pooja Nilangekar (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622879#comment-16622879
 ] 

Pooja Nilangekar edited comment on IMPALA-7367 at 9/20/18 11:49 PM:


I ran TPCH with a scale factor of 60 on a minicluster with a patch for 
StringValue and CollectionValue slots. Here is the summary of results: 

{noformat}
+-+--+--+---+-+---+
| Workload | File Format  | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+-+--+--+---+--+--+
| TPCH(60) | parquet / none / none | 12.45   | -29.84% | 8.63  
| -11.30%  |
+--+-+--+---+--+--+
{noformat}


The queries which showed significant performance gain did use strings or 
timestamps stored as strings. I can understand that we should see an 
improvement, however I am not sure about the magnitude. 

Also there were only 2 queries which showed a regression > 1 %. In both cases, 
the absolute difference was less than 5ms while the query took a few seconds to 
run. So this could just be system noise. 


was (Author: poojanilangekar):
I ran TPCH with a scale factor of 60 on a minicluster with a patch for 
StringValue and CollectionValue slots. Here is the summary of results: 

+-+--+--+---+-+---+
| Workload | File Format  | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+-+--+--+---+--+--+
| TPCH(60) | parquet / none / none | 12.45   | -29.84% | 8.63  
| -11.30%  |
+--+-+--+---+--+--+

The queries which showed significant performance gain did use strings or 
timestamps stored as strings. I can understand that we should see an 
improvement, however I am not sure about the magnitude. 

Also there were only 2 queries which showed a regression > 1 %. In both cases, 
the absolute difference was less than 5ms while the query took a few seconds to 
run. So this could just be system noise. 

> Pack StringValue, CollectionValue and TimestampValue slots
> --
>
> Key: IMPALA-7367
> URL: https://issues.apache.org/jira/browse/IMPALA-7367
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Pooja Nilangekar
>Priority: Major
>  Labels: perfomance
> Attachments: 0001-WIP.patch
>
>
> This is a follow-on to finish up the work from IMPALA-2789. IMPALA-2789 
> didn't actually fully pack the memory layout because StringValue, 
> TimestampValue and CollectionValue still occupy 16 bytes but only have 12 
> bytes of actual data. This results in a higher memory footprint, which leads 
> to higher memory requirements and worse performance. We don't get any benefit 
> from the padding since the majority of tuples are not actually aligned in 
> memory anyway.
> I did a quick version of the change for StringValue only which improves TPC-H 
> performance.
> {noformat}
> Report Generated on 2018-07-30
> Run Description: "b5608264b4552e44eb73ded1e232a8775c3dba6b vs 
> f1e401505ac20c0400eec819b9196f7f506fb927"
> Cluster Name: UNKNOWN
> Lab Run Info: UNKNOWN
> Impala Version:  impalad version 3.1.0-SNAPSHOT RELEASE ()
> Baseline Impala Version: impalad version 3.1.0-SNAPSHOT RELEASE (2018-07-27)
> +--+---+-++++
> | Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
> Delta(GeoMean) |
> +--+---+-++++
> | TPCH(10) | parquet / none / none | 2.69| -4.78% | 2.09   | 
> -3.11% |
> +--+---+-++++
> +--+--+---++-++++-+---+
> | Workload | Query| File Format   | Avg(s) | Base Avg(s) | 
> Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
> +--+--+---++-++++-+---+
> | TPCH(10) | TPCH-Q22 | parquet / none / none | 0.94   | 0.93|   
> +0.75%   |   3.37%|   2.84%| 1

[jira] [Commented] (IMPALA-7367) Pack StringValue, CollectionValue and TimestampValue slots

2018-09-20 Thread Pooja Nilangekar (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622879#comment-16622879
 ] 

Pooja Nilangekar commented on IMPALA-7367:
--

I ran TPCH with a scale factor of 60 on a minicluster with a patch for 
StringValue and CollectionValue slots. Here is the summary of results: 

+-+--+--+---+-+---+
| Workload | File Format  | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+-+--+--+---+--+--+
| TPCH(60) | parquet / none / none | 12.45   | -29.84% | 8.63  
| -11.30%  |
+--+-+--+---+--+--+

The queries which showed significant performance gain did use strings or 
timestamps stored as strings. I can understand that we should see an 
improvement, however I am not sure about the magnitude. 

Also there were only 2 queries which showed a regression > 1 %. In both cases, 
the absolute difference was less than 5ms while the query took a few seconds to 
run. So this could just be system noise. 

> Pack StringValue, CollectionValue and TimestampValue slots
> --
>
> Key: IMPALA-7367
> URL: https://issues.apache.org/jira/browse/IMPALA-7367
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Pooja Nilangekar
>Priority: Major
>  Labels: perfomance
> Attachments: 0001-WIP.patch
>
>
> This is a follow-on to finish up the work from IMPALA-2789. IMPALA-2789 
> didn't actually fully pack the memory layout because StringValue, 
> TimestampValue and CollectionValue still occupy 16 bytes but only have 12 
> bytes of actual data. This results in a higher memory footprint, which leads 
> to higher memory requirements and worse performance. We don't get any benefit 
> from the padding since the majority of tuples are not actually aligned in 
> memory anyway.
> I did a quick version of the change for StringValue only which improves TPC-H 
> performance.
> {noformat}
> Report Generated on 2018-07-30
> Run Description: "b5608264b4552e44eb73ded1e232a8775c3dba6b vs 
> f1e401505ac20c0400eec819b9196f7f506fb927"
> Cluster Name: UNKNOWN
> Lab Run Info: UNKNOWN
> Impala Version:  impalad version 3.1.0-SNAPSHOT RELEASE ()
> Baseline Impala Version: impalad version 3.1.0-SNAPSHOT RELEASE (2018-07-27)
> +--+---+-++++
> | Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
> Delta(GeoMean) |
> +--+---+-++++
> | TPCH(10) | parquet / none / none | 2.69| -4.78% | 2.09   | 
> -3.11% |
> +--+---+-++++
> +--+--+---++-++++-+---+
> | Workload | Query| File Format   | Avg(s) | Base Avg(s) | 
> Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
> +--+--+---++-++++-+---+
> | TPCH(10) | TPCH-Q22 | parquet / none / none | 0.94   | 0.93|   
> +0.75%   |   3.37%|   2.84%| 1   | 30|
> | TPCH(10) | TPCH-Q13 | parquet / none / none | 3.32   | 3.32|   
> +0.13%   |   1.74%|   2.09%| 1   | 30|
> | TPCH(10) | TPCH-Q11 | parquet / none / none | 0.99   | 0.99|   
> -0.02%   |   3.74%|   3.16%| 1   | 30|
> | TPCH(10) | TPCH-Q5  | parquet / none / none | 2.30   | 2.33|   
> -0.96%   |   2.15%|   2.45%| 1   | 30|
> | TPCH(10) | TPCH-Q2  | parquet / none / none | 1.55   | 1.57|   
> -1.45%   |   1.65%|   1.49%| 1   | 30|
> | TPCH(10) | TPCH-Q8  | parquet / none / none | 2.89   | 2.93|   
> -1.51%   |   2.69%|   1.34%| 1   | 30|
> | TPCH(10) | TPCH-Q9  | parquet / none / none | 5.96   | 6.06|   
> -1.63%   |   1.34%|   1.82%| 1   | 30|
> | TPCH(10) | TPCH-Q20 | parquet / none / none | 1.58   | 1.61|   
> -1.85%   |   2.28%|   2.16%| 1   | 30|
> | TPCH(10) | TPCH-Q16 | parquet / none / none | 1.18   | 1.21|   
> -2.11%   |   3.68%|   4.72%| 1   | 30|
> | TPCH(10) | TPCH-Q3  | parquet / none / none | 2.13   | 2.18 

[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622856#comment-16622856
 ] 

Paul Rogers edited comment on IMPALA-7310 at 9/20/18 11:36 PM:
---

The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:
 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is 
other than -1. Since NDV will always not be some value, change this method to 
check only {{numNulls_}}, which will continue to be -1 without stats.

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.


was (Author: paul.rogers):
The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:

 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

Then, add a flag (if some useful field does not exist) to indicate if the NDV 
is from stats or estimates. (Will be useful in computing selectivity later.)

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: 

[jira] [Commented] (IMPALA-7603) Incorrect NDV expression for col1 op col2

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622779#comment-16622779
 ] 

Paul Rogers commented on IMPALA-7603:
-

Turns out that a similar limitation exists for functions. Consider the 
following test:

{noformat}
verifyNdv("nullValue(id)", 2);
{noformat}

This test fails, The actual value is 7300, which is the NDV of the {{id}} 
column. So, the code computes a wrong result.

The {{nullValue()}} function returns a {{Boolean}}, so can have only two 
values. But, we use a generic formula of

{noformat}
NDV(f(x)) = NDV(x)
{noformat}

Though it is probably not that important, we could restrict the NDV to the max 
of either the argument or the return type (in this case, {{Boolean}}, which is 
2.)

> Incorrect NDV expression for col1 op col2
> -
>
> Key: IMPALA-7603
> URL: https://issues.apache.org/jira/browse/IMPALA-7603
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the 
> [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
>  test case. The code contains tests for the CASE expression. Add tests for 
> simple arithmetic expressions:
> {noformat}
> verifyNdv("id + 2", 7300);
> verifyNdv("id * 2", 7300);
> {noformat}
> The above suggests that the NDV of a column op const is
> {noformat}
> max(NDV(column), NDV(const)) =
> max(NDV(column), 1) = NDV(column)
> {noformat}
> This is good and as expected.
> Now try two columns:
> {noformat}
> verifyNdv("id + int_col", 7300);
> verifyNdv("id * int_col", 7300);
> {noformat}
> This is *not* expected. Though the two columns are from the same table, they 
> are not correlated: there is no reason to believe that the value of "id" 
> determines the value of "int_col" in the general case. (Perhaps the table is 
> the Cartesian product of the two fields.)
> In this case, the calculation should be:
> {noformat}
> NDV(a op b) = NDV(a) * NDV(b)
> {noformat}
> There might be some back-off to account for overlapping results. Could not 
> readily find a reference for these calcs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7603) Incorrect NDV expression for col1 op col2

2018-09-20 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-7603:

Description: 
Consider the 
[{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
 test case. The code contains tests for the CASE expression. Add tests for 
simple arithmetic expressions:

{noformat}
verifyNdv("id + 2", 7300);
verifyNdv("id * 2", 7300);
{noformat}

The above suggests that the NDV of a column op const is

{noformat}
max(NDV(column), NDV(const)) =
max(NDV(column), 1) = NDV(column)
{noformat}

This is good and as expected.

Now try two columns:

{noformat}
verifyNdv("id + int_col", 7300);
verifyNdv("id * int_col", 7300);
{noformat}

This is *not* expected. Though the two columns are from the same table, they 
are not correlated: there is no reason to believe that the value of "id" 
determines the value of "int_col" in the general case. (Perhaps the table is 
the Cartesian product of the two fields.)

In this case, the calculation should be:

{noformat}
NDV(a op b) = NDV(a) * NDV(b)
{noformat}

There might be some back-off to account for overlapping results. Could not 
readily find a reference for these calcs.

  was:
Consider the 
[[{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
 test case. The code contains tests for the CASE expression. Add tests for 
simple arithmetic expressions:

{noformat}
verifyNdv("id + 2", 7300);
verifyNdv("id * 2", 7300);
{noformat}

The above suggests that the NDV of a column op const is

{noformat}
max(NDV(column), NDV(const)) =
max(NDV(column), 1) = NDV(column)
{noformat}

This is good and as expected.

Now try two columns:

{noformat}
verifyNdv("id + int_col", 7300);
verifyNdv("id * int_col", 7300);
{noformat}

This is *not* expected. Though the two columns are from the same table, they 
are not correlated: there is no reason to believe that the value of "id" 
determines the value of "int_col" in the general case. (Perhaps the table is 
the Cartesian product of the two fields.)

In this case, the calculation should be:

{noformat}
NDV(a op b) = NDV(a) * NDV(b)
{noformat}

There might be some back-off to account for overlapping results. Could not 
readily find a reference for these calcs.


> Incorrect NDV expression for col1 op col2
> -
>
> Key: IMPALA-7603
> URL: https://issues.apache.org/jira/browse/IMPALA-7603
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the 
> [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
>  test case. The code contains tests for the CASE expression. Add tests for 
> simple arithmetic expressions:
> {noformat}
> verifyNdv("id + 2", 7300);
> verifyNdv("id * 2", 7300);
> {noformat}
> The above suggests that the NDV of a column op const is
> {noformat}
> max(NDV(column), NDV(const)) =
> max(NDV(column), 1) = NDV(column)
> {noformat}
> This is good and as expected.
> Now try two columns:
> {noformat}
> verifyNdv("id + int_col", 7300);
> verifyNdv("id * int_col", 7300);
> {noformat}
> This is *not* expected. Though the two columns are from the same table, they 
> are not correlated: there is no reason to believe that the value of "id" 
> determines the value of "int_col" in the general case. (Perhaps the table is 
> the Cartesian product of the two fields.)
> In this case, the calculation should be:
> {noformat}
> NDV(a op b) = NDV(a) * NDV(b)
> {noformat}
> There might be some back-off to account for overlapping results. Could not 
> readily find a reference for these calcs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7603) Incorrect NDV expression for col1 op col2

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7603:
---

 Summary: Incorrect NDV expression for col1 op col2
 Key: IMPALA-7603
 URL: https://issues.apache.org/jira/browse/IMPALA-7603
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Paul Rogers


Consider the [[{{ExprNdvTest}}|] test case. The code contains tests for the 
CASE expression. Add tests for simple arithmetic expressions:

{noformat}
verifyNdv("id + 2", 7300);
verifyNdv("id * 2", 7300);
{noformat}

The above suggests that the NDV of a column op const is

{noformat}
max(NDV(column), NDV(const)) =
max(NDV(column), 1) = NDV(column)
{noformat}

This is good and as expected.

Now try two columns:

{noformat}
verifyNdv("id + int_col", 7300);
verifyNdv("id * int_col", 7300);
{noformat}

This is *not* expected. Though the two columns are from the same table, they 
are not correlated: there is no reason to believe that the value of "id" 
determines the value of "int_col" in the general case. (Perhaps the table is 
the Cartesian product of the two fields.)

In this case, the calculation should be:

{noformat}
NDV(a op b) = NDV(a) * NDV(b)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7603) Incorrect NDV expression for col1 op col2

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7603:
---

 Summary: Incorrect NDV expression for col1 op col2
 Key: IMPALA-7603
 URL: https://issues.apache.org/jira/browse/IMPALA-7603
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Paul Rogers


Consider the [[{{ExprNdvTest}}|] test case. The code contains tests for the 
CASE expression. Add tests for simple arithmetic expressions:

{noformat}
verifyNdv("id + 2", 7300);
verifyNdv("id * 2", 7300);
{noformat}

The above suggests that the NDV of a column op const is

{noformat}
max(NDV(column), NDV(const)) =
max(NDV(column), 1) = NDV(column)
{noformat}

This is good and as expected.

Now try two columns:

{noformat}
verifyNdv("id + int_col", 7300);
verifyNdv("id * int_col", 7300);
{noformat}

This is *not* expected. Though the two columns are from the same table, they 
are not correlated: there is no reason to believe that the value of "id" 
determines the value of "int_col" in the general case. (Perhaps the table is 
the Cartesian product of the two fields.)

In this case, the calculation should be:

{noformat}
NDV(a op b) = NDV(a) * NDV(b)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622712#comment-16622712
 ] 

Paul Rogers edited comment on IMPALA-7310 at 9/20/18 9:08 PM:
--

Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general 
issue, allowing this ticket to focus on the specific issue of handling a column 
full of NULLS when doing join planning. The general solution can solve this, or 
we can also solve this ticket with a specific fix for this one case.


was (Author: paul.rogers):
Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general 
issue, allowing this ticket to focus on the specific issue of handling a column 
full of NULLS. The general solution can solve this, or we can solve this with a 
specific fix for this one case.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7602) Definition of NDV differs between planner and stats mechanism

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7602:
---

 Summary: Definition of NDV differs between planner and stats 
mechanism
 Key: IMPALA-7602
 URL: https://issues.apache.org/jira/browse/IMPALA-7602
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Paul Rogers


See IMPALA-7310 which says that the Impala NDV function is implemented as 
"number of non-null distinct values." IMPALA-7310 also says that the stats 
gathering mechanism uses the same definition.

Down in the comments, we point to 
[{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
 which shows that, in the planner itself, when working with constant 
expressions, NULL is considered a distinct value.

In the case described in IMPALA-7310, this means that a column of only nulls 
has an NDV=0 if stats are used, NDV=1 if constants are used.

This is a minor point, but would be good to use a single definition everywhere. 
That way, if we use the "number of non-null distinct values" rule, the 
"adjusted NDV" is always one more than the "raw" NDV. As it is now, we can't be 
sure when to add the null adjustment because we don't know if it is already 
included.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-7602) Definition of NDV differs between planner and stats mechanism

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7602:
---

 Summary: Definition of NDV differs between planner and stats 
mechanism
 Key: IMPALA-7602
 URL: https://issues.apache.org/jira/browse/IMPALA-7602
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Paul Rogers


See IMPALA-7310 which says that the Impala NDV function is implemented as 
"number of non-null distinct values." IMPALA-7310 also says that the stats 
gathering mechanism uses the same definition.

Down in the comments, we point to 
[{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java]
 which shows that, in the planner itself, when working with constant 
expressions, NULL is considered a distinct value.

In the case described in IMPALA-7310, this means that a column of only nulls 
has an NDV=0 if stats are used, NDV=1 if constants are used.

This is a minor point, but would be good to use a single definition everywhere. 
That way, if we use the "number of non-null distinct values" rule, the 
"adjusted NDV" is always one more than the "raw" NDV. As it is now, we can't be 
sure when to add the null adjustment because we don't know if it is already 
included.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622712#comment-16622712
 ] 

Paul Rogers commented on IMPALA-7310:
-

Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general 
issue, allowing this ticket to focus on the specific issue of handling a column 
full of NULLS. The general solution can solve this, or we can solve this with a 
specific fix for this one case.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7601) Define a-priori selectivity and NDV values

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622709#comment-16622709
 ] 

Paul Rogers commented on IMPALA-7601:
-

Please see [~tarmstr...@cloudera.com]'s comment in IMPALA-7310 for a bit of 
history.

To address the points there:

1. Having defaults makes testing *easier* because tests can be written using 
the defaults. Such tests focus on verifying that values propagate up the plan 
tree as expected. Such tests need no external data.
 2. Tests with external data (real stats) then can focus on whether the numbers 
are applied where expected, rather than using the external data to validate 
planning calcs.
 3. Queries perform adequately even in the absence of stats. This take pressure 
off needing to recalc stats over and over.
 4. Reduction of complexity is a good thing. Having adequate unit tests to 
cover all paths is better. This then allows us to implement (and test) both the 
default and with-stats code paths.

Along these lines, the changes proposed here only make sense if:

1. Unit tests are created (or available) to verify the calcs work as expected.
 2. A mechanism is available to access the impact of the changes in plans that 
will result.

These are non-trivial efforts, so this is more of a longer term suggestion than 
an immediate fix.

> Define a-priori selectivity and NDV values
> --
>
> Key: IMPALA-7601
> URL: https://issues.apache.org/jira/browse/IMPALA-7601
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example, 
> the NDV (number of distinct values) is used to compute _selectivity_, the 
> degree of reduction (also called the _reduction factor_) provided by a 
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
> distribution of values), that the above query will pick out one of these 100 
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity 
> of the predicate {{t.a = 10}} is 0.01.
> h4. Selectivity Without Stats
> All this is good. But, what happens if statistics are not available for table 
> {{t}}? How are we to know the selectivity of the predicate?
> It could be that {{t.a}} contains nothing but the value 10, so there is no 
> reduction at all. I could be that {{t.a}} contains no values of 10, so the 
> reduction is total, no rows are returned. The classic solution is to assume 
> that the user put the predicate in the query for the purpose of subsetting 
> the data. The classic value, shown in the [Ramakrishnan and Gehrke 
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
> seen in 
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
>   // To be used where we cannot come up with a
>   // better estimate (selectivity_ is -1).
>   public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> As it turns out, however, the actual implementation is a bit more complex, as 
> hinted at by the above comment. Impala relies on stats. Given stats, 
> specifically the NDV, we compute selectivity as:
> {noformat}
> selectivity = 1 / ndv
> {noformat}
> What happens if there is no available NDV? In the case, we skip the 
> selectivity calculation and leave it at a special default value of -1.0, 
> which seems to indicate "unknown". See 
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> Later, when we use the selectivity to calculate reduction factors, we simply 
> skip any node with a selectivity of -1. You can see that in 
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If 
> stats are present, they are used. If stats are not present, Impala assumes 
> that predicates have no effect.
> This ticket proposal a number of interrelated changes to add a-priori (before 
> observation) defaults for selectivity and NDV based on classic DB practice.
> h4. Proposal: Add A-priori Selectivity Values
> But, we said earlier that users include a predicate because they expect it to 
> do something. So, we are actually discarding the (albeit vague) information 
> that the user provided.
> This is why many optimizers go ahead and assume a default 0.1 reduction 
> factor 

[jira] [Created] (IMPALA-7601) Define a-priori selectivity and NDV values

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7601:
---

 Summary: Define a-priori selectivity and NDV values
 Key: IMPALA-7601
 URL: https://issues.apache.org/jira/browse/IMPALA-7601
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.12.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Impala makes extensive use of table stats during query planning. For example, 
the NDV (number of distinct values) is used to compute _selectivity_, the 
degree of reduction (also called the _reduction factor_) provided by a 
predicate. For example:

{noformat}
SELECT * FROM t WHERE t.a = 10
{noformat}

If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
distribution of values), that the above query will pick out one of these 100 
values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity of 
the predicate {{t.a = 10}} is 0.01.

h4. Selectivity Without Stats

All this is good. But, what happens if statistics are not available for table 
{{t}}? How are we to know the selectivity of the predicate?

It could be that {{t.a}} contains nothing but the value 10, so there is no 
reduction at all. I could be that {{t.a}} contains no values of 10, so the 
reduction is total, no rows are returned. The classic solution is to assume 
that the user put the predicate in the query for the purpose of subsetting the 
data. The classic value, shown in the [Ramakrishnan and Gehrke 
book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
seen in 
[Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:

{noformat}
  // To be used where we cannot come up with a
  // better estimate (selectivity_ is -1).
  public static double DEFAULT_SELECTIVITY = 0.1;
{noformat}

As it turns out, however, the actual implementation is a bit more complex, as 
hinted at by the above comment. Impala relies on stats. Given stats, 
specifically the NDV, we compute selectivity as:

{noformat}
selectivity = 1 / ndv
{noformat}

What happens if there is no available NDV? In the case, we skip the selectivity 
calculation and leave it at a special default value of -1.0, which seems to 
indicate "unknown". See 
[{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].

Later, when we use the selectivity to calculate reduction factors, we simply 
skip any node with a selectivity of -1. You can see that in 
[{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].

The result is that Impala is a bit more strict than classic DB optimizers. If 
stats are present, they are used. If stats are not present, Impala assumes that 
predicates have no effect.

h4. Proposal: Add A-priori Selectivity Values

But, we said earlier that users include a predicate because they expect it to 
do something. So, we are actually discarding the (albeit vague) information 
that the user provided.

This is why many optimizers go ahead and assume a default 0.1 reduction factor 
for equality predicates even if no stats are available. The first proposal of 
this ticket is to use that default reduction factor even if no stats are 
present. This says that some reduction will occur, but, to be conservative, we 
assume not a huge reduction.

h4. Proposal: Add Selectivity for All Predicate Operators

As present, Impala computes reduction factors only for equality nodes. (See 
IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:

* {{!=}} - 0.1
* {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
* {{BETWEEN}} - 0.25

Over in the Drill project, DRILL-5254 attempted to work out better estimates 
based on math and probability. However, the conclusion there was that, without 
NDV and histograms, there is more information in the user's intent than in the 
math. That is, if the user writes {{WHERE t.a \!= 10}}, there is a conditional 
probability that the user believes that this is a highly restrictive predicate, 
especially on big data. So, the reduction factor (which is a probability) is 
the same for {{=}} and {{!=}} in the absence of information. The same reasoning 
probably led to the rule-of-thumb values in the 
[R|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 book.

So, the second proposal is that Impala use the classic numbers for other 
operators when no stats are available.

h4. Proposal: Use Stats-Based Selectivity Estimates When Available

If stats are available, then we can "run the numbers" and get better estimates:

* {{p(a = x)}} = 1 / NDV
* {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV

So, the third proposal is to use the above 

[jira] [Created] (IMPALA-7601) Define a-priori selectivity and NDV values

2018-09-20 Thread Paul Rogers (JIRA)
Paul Rogers created IMPALA-7601:
---

 Summary: Define a-priori selectivity and NDV values
 Key: IMPALA-7601
 URL: https://issues.apache.org/jira/browse/IMPALA-7601
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.12.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Impala makes extensive use of table stats during query planning. For example, 
the NDV (number of distinct values) is used to compute _selectivity_, the 
degree of reduction (also called the _reduction factor_) provided by a 
predicate. For example:

{noformat}
SELECT * FROM t WHERE t.a = 10
{noformat}

If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
distribution of values), that the above query will pick out one of these 100 
values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity of 
the predicate {{t.a = 10}} is 0.01.

h4. Selectivity Without Stats

All this is good. But, what happens if statistics are not available for table 
{{t}}? How are we to know the selectivity of the predicate?

It could be that {{t.a}} contains nothing but the value 10, so there is no 
reduction at all. I could be that {{t.a}} contains no values of 10, so the 
reduction is total, no rows are returned. The classic solution is to assume 
that the user put the predicate in the query for the purpose of subsetting the 
data. The classic value, shown in the [Ramakrishnan and Gehrke 
book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
seen in 
[Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:

{noformat}
  // To be used where we cannot come up with a
  // better estimate (selectivity_ is -1).
  public static double DEFAULT_SELECTIVITY = 0.1;
{noformat}

As it turns out, however, the actual implementation is a bit more complex, as 
hinted at by the above comment. Impala relies on stats. Given stats, 
specifically the NDV, we compute selectivity as:

{noformat}
selectivity = 1 / ndv
{noformat}

What happens if there is no available NDV? In the case, we skip the selectivity 
calculation and leave it at a special default value of -1.0, which seems to 
indicate "unknown". See 
[{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].

Later, when we use the selectivity to calculate reduction factors, we simply 
skip any node with a selectivity of -1. You can see that in 
[{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].

The result is that Impala is a bit more strict than classic DB optimizers. If 
stats are present, they are used. If stats are not present, Impala assumes that 
predicates have no effect.

h4. Proposal: Add A-priori Selectivity Values

But, we said earlier that users include a predicate because they expect it to 
do something. So, we are actually discarding the (albeit vague) information 
that the user provided.

This is why many optimizers go ahead and assume a default 0.1 reduction factor 
for equality predicates even if no stats are available. The first proposal of 
this ticket is to use that default reduction factor even if no stats are 
present. This says that some reduction will occur, but, to be conservative, we 
assume not a huge reduction.

h4. Proposal: Add Selectivity for All Predicate Operators

As present, Impala computes reduction factors only for equality nodes. (See 
IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:

* {{!=}} - 0.1
* {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
* {{BETWEEN}} - 0.25

Over in the Drill project, DRILL-5254 attempted to work out better estimates 
based on math and probability. However, the conclusion there was that, without 
NDV and histograms, there is more information in the user's intent than in the 
math. That is, if the user writes {{WHERE t.a \!= 10}}, there is a conditional 
probability that the user believes that this is a highly restrictive predicate, 
especially on big data. So, the reduction factor (which is a probability) is 
the same for {{=}} and {{!=}} in the absence of information. The same reasoning 
probably led to the rule-of-thumb values in the 
[R|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 book.

So, the second proposal is that Impala use the classic numbers for other 
operators when no stats are available.

h4. Proposal: Use Stats-Based Selectivity Estimates When Available

If stats are available, then we can "run the numbers" and get better estimates:

* {{p(a = x)}} = 1 / NDV
* {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV

So, the third proposal is to use the above 

[jira] [Commented] (IMPALA-5654) Disallow managed Kudu table to explicitly set Kudu tbl name in CREATE TABLE

2018-09-20 Thread Dan Burkert (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622528#comment-16622528
 ] 

Dan Burkert commented on IMPALA-5654:
-

[~boristyukin], [~sergey.ben...@gmail.com] the issue with the old behavior was 
that it could be used to circumvent mandatory access control in Impala.  
Consider a user with permissions to table {{a}} but not table {{b}}, they could 
alter table {{a}} to point to table {{b}}, and thus gain access.

> Disallow managed Kudu table to explicitly set Kudu tbl name in CREATE TABLE
> ---
>
> Key: IMPALA-5654
> URL: https://issues.apache.org/jira/browse/IMPALA-5654
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.8.0
>Reporter: Matthew Jacobs
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: kudu
> Fix For: Impala 2.12.0
>
>
> There's no reason to allow this behavior. Managed tables create Kudu tables 
> with the name (in Kudu) "impala::db_name.table_name". Renaming (in Impala) a 
> managed Kudu table results in renaming the underlying Kudu table, e.g. rename 
> table_name to new_table name results in changing the Kudu table to 
> "impala::db_name.new_table_name". But allowing a new table to specify the 
> kudu table name is inconsistent with the renaming behavior and just 
> introduces opportunities for confusion.
> {code}
>   private void analyzeManagedKuduTableParams(Analyzer analyzer) throws 
> AnalysisException {
> // If no Kudu table name is specified in tblproperties, generate one 
> using the
> // current database as a prefix to avoid conflicts in Kudu.
> // TODO: Disallow setting this manually for managed tables
> if (!getTblProperties().containsKey(KuduTable.KEY_TABLE_NAME)) {
>   getTblProperties().put(KuduTable.KEY_TABLE_NAME,
>   KuduUtil.getDefaultCreateKuduTableName(getDb(), getTbl()));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-1760) Add decommissioning support / graceful shutdown / quiesce

2018-09-20 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622481#comment-16622481
 ] 

Tim Armstrong commented on IMPALA-1760:
---

I created a subtask for one failure. The other failure was on S3 in 
test_shutdown_executor, where it appears that because of the S3 synthetic block 
thing, the files of lineitem were carved into 32MB chunks and some executors 
didn't actually get any midpoints of parquet row groups, so didn't have any 
work to do and finished earlier.

> Add decommissioning support / graceful shutdown / quiesce
> -
>
> Key: IMPALA-1760
> URL: https://issues.apache.org/jira/browse/IMPALA-1760
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Distributed Exec
>Affects Versions: Impala 2.1.1
>Reporter: Henry Robinson
>Assignee: Tim Armstrong
>Priority: Critical
>  Labels: resource-management, scalability, scheduler, usability
>
> In larger clusters, node maintenance is a frequent occurrence. There's no way 
> currently to stop an Impala node without failing running queries, without 
> draining queries across the whole cluster first. We should fix that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-4308) Make the minidumps archived in our Jenkins jobs usable

2018-09-20 Thread Thomas Tauber-Marshall (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Tauber-Marshall reassigned IMPALA-4308:
--

Assignee: Thomas Tauber-Marshall

> Make the minidumps archived in our Jenkins jobs usable
> --
>
> Key: IMPALA-4308
> URL: https://issues.apache.org/jira/browse/IMPALA-4308
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Taras Bobrovytsky
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: breakpad, test-infra
>
> The minidumps that are archived in our Jenkins jobs are unusable because we 
> do not save the symbols that are required to extract stack traces. As part of 
> the log archiving process, we should:
> # Extract the necessary symbols and save them into the $IMPALA_HOME/logs 
> directory.
> # Automatically collect the backtraces from the minidumps and save them into 
> $IMPALA_HOME/logs directory in a text file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-7488) TestShellCommandLine::test_cancellation hangs occasionally

2018-09-20 Thread Thomas Tauber-Marshall (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Tauber-Marshall resolved IMPALA-7488.

   Resolution: Fixed
Fix Version/s: Impala 3.1.0

> TestShellCommandLine::test_cancellation hangs occasionally
> --
>
> Key: IMPALA-7488
> URL: https://issues.apache.org/jira/browse/IMPALA-7488
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.1.0
>Reporter: Tim Armstrong
>Assignee: Thomas Tauber-Marshall
>Priority: Critical
>  Labels: broken-build
> Fix For: Impala 3.1.0
>
> Attachments: psauxf.txt
>
>
> We've seen a couple of hung builds with no queries running on the cluster. I 
> got "ps auxf" output and it looks like an impala-shell process is hanging 
> around.
> I'm guessing the IMPALA-7407 fix somehow relates to this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-7600) Mem limit exceeded in test_kudu_scan_mem_usage

2018-09-20 Thread Thomas Tauber-Marshall (JIRA)
Thomas Tauber-Marshall created IMPALA-7600:
--

 Summary: Mem limit exceeded in test_kudu_scan_mem_usage
 Key: IMPALA-7600
 URL: https://issues.apache.org/jira/browse/IMPALA-7600
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 3.1.0
Reporter: Thomas Tauber-Marshall
Assignee: Bikramjeet Vig


Seen in an exhaustive release build:
{noformat}
00:05:35  TestScanMemLimit.test_kudu_scan_mem_usage[exec_option: {'batch_size': 
0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 'disable_codegen': 
False, 'abort_on_error': 1, 'debug_action': None, 
'exec_single_node_rows_threshold': 0} | table_format: avro/snap/block] 
00:05:35 [gw6] linux2 -- Python 2.7.5 
/data/jenkins/workspace/impala-asf-master-exhaustive-release/repos/Impala/bin/../infra/python/env/bin/python
00:05:35 query_test/test_mem_usage_scaling.py:358: in test_kudu_scan_mem_usage
00:05:35 self.run_test_case('QueryTest/kudu-scan-mem-usage', vector)
00:05:35 common/impala_test_suite.py:408: in run_test_case
00:05:35 result = self.__execute_query(target_impalad_client, query, 
user=user)
00:05:35 common/impala_test_suite.py:623: in __execute_query
00:05:35 return impalad_client.execute(query, user=user)
00:05:35 common/impala_connection.py:160: in execute
00:05:35 return self.__beeswax_client.execute(sql_stmt, user=user)
00:05:35 beeswax/impala_beeswax.py:176: in execute
00:05:35 handle = self.__execute_query(query_string.strip(), user=user)
00:05:35 beeswax/impala_beeswax.py:350: in __execute_query
00:05:35 self.wait_for_finished(handle)
00:05:35 beeswax/impala_beeswax.py:371: in wait_for_finished
00:05:35 raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
00:05:35 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
00:05:35 EQuery aborted:Memory limit exceeded: Error occurred on backend 
impala-ec2-centos74-m5-4xlarge-ondemand-0e2c.vpc.cloudera.com:22000 by fragment 
b34270820f59a0c9:a507139e0001
00:05:35 E   Memory left in process limit: 10.12 GB
00:05:35 E   Memory left in query limit: -16.92 KB
00:05:35 E   Query(b34270820f59a0c9:a507139e): memory limit exceeded. 
Limit=4.00 MB Reservation=0 ReservationLimit=0 OtherMemory=4.02 MB Total=4.02 
MB Peak=4.02 MB
00:05:35 E Fragment b34270820f59a0c9:a507139e: Reservation=0 
OtherMemory=40.10 KB Total=40.10 KB Peak=340.00 KB
00:05:35 E   EXCHANGE_NODE (id=2): Reservation=32.00 KB OtherMemory=0 
Total=32.00 KB Peak=32.00 KB
00:05:35 E KrpcDeferredRpcs: Total=0 Peak=0
00:05:35 E   PLAN_ROOT_SINK: Total=0 Peak=0
00:05:35 E   CodeGen: Total=103.00 B Peak=332.00 KB
00:05:35 E Fragment b34270820f59a0c9:a507139e0001: Reservation=0 
OtherMemory=3.98 MB Total=3.98 MB Peak=3.98 MB
00:05:35 E   SORT_NODE (id=1): Total=342.00 KB Peak=342.00 KB
00:05:35 E   KUDU_SCAN_NODE (id=0): Total=3.63 MB Peak=3.63 MB
00:05:35 E Queued Batches: Total=3.30 MB Peak=3.63 MB
00:05:35 E   KrpcDataStreamSender (dst_id=2): Total=1.16 KB Peak=1.16 KB
00:05:35 E   CodeGen: Total=3.66 KB Peak=1.14 MB
00:05:35 E   
00:05:35 E   Memory limit exceeded: Error occurred on backend 
impala-ec2-centos74-m5-4xlarge-ondemand-0e2c.vpc.cloudera.com:22000 by fragment 
b34270820f59a0c9:a507139e0001
00:05:35 E   Memory left in process limit: 10.12 GB
00:05:35 E   Memory left in query limit: -16.92 KB
00:05:35 E   Query(b34270820f59a0c9:a507139e): memory limit exceeded. 
Limit=4.00 MB Reservation=0 ReservationLimit=0 OtherMemory=4.02 MB Total=4.02 
MB Peak=4.02 MB
00:05:35 E Fragment b34270820f59a0c9:a507139e: Reservation=0 
OtherMemory=40.10 KB Total=40.10 KB Peak=340.00 KB
00:05:35 E   EXCHANGE_NODE (id=2): Reservation=32.00 KB OtherMemory=0 
Total=32.00 KB Peak=32.00 KB
00:05:35 E KrpcDeferredRpcs: Total=0 Peak=0
00:05:35 E   PLAN_ROOT_SINK: Total=0 Peak=0
00:05:35 E   CodeGen: Total=103.00 B Peak=332.00 KB
00:05:35 E Fragment b34270820f59a0c9:a507139e0001: Reservation=0 
OtherMemory=3.98 MB Total=3.98 MB Peak=3.98 MB
00:05:35 E   SORT_NODE (id=1): Total=342.00 KB Peak=342.00 KB
00:05:35 E   KUDU_SCAN_NODE (id=0): Total=3.63 MB Peak=3.63 MB
00:05:35 E Queued Batches: Total=3.30 MB Peak=3.63 MB
00:05:35 E   KrpcDataStreamSender (dst_id=2): Total=1.16 KB Peak=1.16 KB
00:05:35 E   CodeGen: Total=3.66 KB Peak=1.14 MB (1 of 2 similar)
00:05:35 - Captured stderr call 
-
00:05:35 -- executing against localhost:21000
00:05:35 use functional_avro_snap;
00:05:35 
00:05:35 -- 2018-09-19 22:07:44,471 INFO MainThread: Started query 
ca487b9fdcc14d67:58776467
00:05:35 SET batch_size=0;
00:05:35 SET num_nodes=0;
00:05:35 SET disable_codegen_rows_threshold=5000;
00:05:35 SET abort_on_error=1;
00:05:35 SET 

[jira] [Resolved] (IMPALA-7579) TestObservability.test_query_profile_contains_all_events fails for S3 tests

2018-09-20 Thread Andrew Sherman (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman resolved IMPALA-7579.

   Resolution: Fixed
Fix Version/s: Impala 3.1.0

Test now passes in S3

> TestObservability.test_query_profile_contains_all_events fails for S3 tests
> ---
>
> Key: IMPALA-7579
> URL: https://issues.apache.org/jira/browse/IMPALA-7579
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.1.0
>Reporter: Vuk Ercegovac
>Assignee: Andrew Sherman
>Priority: Blocker
> Fix For: Impala 3.1.0
>
>
> For S3 tests, the test introduced in [https://gerrit.cloudera.org/#/c/11387/] 
> fails with:
> {noformat}
> query_test/test_observability.py:225: in 
> test_query_profile_contains_all_events
> self.hdfs_client.delete_file_dir(path)
> util/hdfs_util.py:90: in delete_file_dir
> if not self.exists(path):
> util/hdfs_util.py:138: in exists
> self.get_file_dir_status(path)
> util/hdfs_util.py:102: in get_file_dir_status
> return super(PyWebHdfsClientWithChmod, self).get_file_dir_status(path)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/pywebhdfs/webhdfs.py:335:
>  in get_file_dir_status
> response = requests.get(uri, allow_redirects=True)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69:
>  in get
> return request('get', url, params=params, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50:
>  in request
> response = session.request(method=method, url=url, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465:
>  in request
> resp = self.send(prep, **send_kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573:
>  in send
> r = adapter.send(request, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415:
>  in send
> raise ConnectionError(err, request=request)
> E   ConnectionError: ('Connection aborted.', error(111, 'Connection 
> refused')){noformat}
> The dir delete might want to be guarded by an "if exists". The failure cases 
> may differ between hdfs and s3, which is probably what this test ran into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-7579) TestObservability.test_query_profile_contains_all_events fails for S3 tests

2018-09-20 Thread Andrew Sherman (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman resolved IMPALA-7579.

   Resolution: Fixed
Fix Version/s: Impala 3.1.0

Test now passes in S3

> TestObservability.test_query_profile_contains_all_events fails for S3 tests
> ---
>
> Key: IMPALA-7579
> URL: https://issues.apache.org/jira/browse/IMPALA-7579
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.1.0
>Reporter: Vuk Ercegovac
>Assignee: Andrew Sherman
>Priority: Blocker
> Fix For: Impala 3.1.0
>
>
> For S3 tests, the test introduced in [https://gerrit.cloudera.org/#/c/11387/] 
> fails with:
> {noformat}
> query_test/test_observability.py:225: in 
> test_query_profile_contains_all_events
> self.hdfs_client.delete_file_dir(path)
> util/hdfs_util.py:90: in delete_file_dir
> if not self.exists(path):
> util/hdfs_util.py:138: in exists
> self.get_file_dir_status(path)
> util/hdfs_util.py:102: in get_file_dir_status
> return super(PyWebHdfsClientWithChmod, self).get_file_dir_status(path)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/pywebhdfs/webhdfs.py:335:
>  in get_file_dir_status
> response = requests.get(uri, allow_redirects=True)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69:
>  in get
> return request('get', url, params=params, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50:
>  in request
> response = session.request(method=method, url=url, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465:
>  in request
> resp = self.send(prep, **send_kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573:
>  in send
> r = adapter.send(request, **kwargs)
> /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415:
>  in send
> raise ConnectionError(err, request=request)
> E   ConnectionError: ('Connection aborted.', error(111, 'Connection 
> refused')){noformat}
> The dir delete might want to be guarded by an "if exists". The failure cases 
> may differ between hdfs and s3, which is probably what this test ran into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-4044) Archive impalad, statestored, catalogd along with minidump for Jenkins jobs

2018-09-20 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker resolved IMPALA-4044.
-
   Resolution: Duplicate
Fix Version/s: Not Applicable

> Archive impalad, statestored, catalogd along with minidump for Jenkins jobs
> ---
>
> Key: IMPALA-4044
> URL: https://issues.apache.org/jira/browse/IMPALA-4044
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.6.0
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: breakpad, test-infra
> Fix For: Not Applicable
>
>
> I've noticed that some Jenkins jobs do not archive the impala binaries in 
> case of a crash where only a minidump is generated and not a core dump.
> This makes it hard to resolve the minidump without the symbols.
> Example job:
> http://sandbox.jenkins.cloudera.com/job/impala-private-build-and-test/4076/
> (Marked as keep forever now, should unmark once done)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-4044) Archive impalad, statestored, catalogd along with minidump for Jenkins jobs

2018-09-20 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker resolved IMPALA-4044.
-
   Resolution: Duplicate
Fix Version/s: Not Applicable

> Archive impalad, statestored, catalogd along with minidump for Jenkins jobs
> ---
>
> Key: IMPALA-4044
> URL: https://issues.apache.org/jira/browse/IMPALA-4044
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.6.0
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: breakpad, test-infra
> Fix For: Not Applicable
>
>
> I've noticed that some Jenkins jobs do not archive the impala binaries in 
> case of a crash where only a minidump is generated and not a core dump.
> This makes it hard to resolve the minidump without the symbols.
> Example job:
> http://sandbox.jenkins.cloudera.com/job/impala-private-build-and-test/4076/
> (Marked as keep forever now, should unmark once done)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-4308) Make the minidumps archived in our Jenkins jobs usable

2018-09-20 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated IMPALA-4308:

Issue Type: Improvement  (was: Bug)

> Make the minidumps archived in our Jenkins jobs usable
> --
>
> Key: IMPALA-4308
> URL: https://issues.apache.org/jira/browse/IMPALA-4308
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Taras Bobrovytsky
>Priority: Major
>  Labels: breakpad, test-infra
>
> The minidumps that are archived in our Jenkins jobs are unusable because we 
> do not save the symbols that are required to extract stack traces. As part of 
> the log archiving process, we should:
> # Extract the necessary symbols and save them into the $IMPALA_HOME/logs 
> directory.
> # Automatically collect the backtraces from the minidumps and save them into 
> $IMPALA_HOME/logs directory in a text file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-4308) Make the minidumps archived in our Jenkins jobs usable

2018-09-20 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated IMPALA-4308:

Labels: breakpad test-infra  (was: )

> Make the minidumps archived in our Jenkins jobs usable
> --
>
> Key: IMPALA-4308
> URL: https://issues.apache.org/jira/browse/IMPALA-4308
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Taras Bobrovytsky
>Priority: Major
>  Labels: breakpad, test-infra
>
> The minidumps that are archived in our Jenkins jobs are unusable because we 
> do not save the symbols that are required to extract stack traces. As part of 
> the log archiving process, we should:
> # Extract the necessary symbols and save them into the $IMPALA_HOME/logs 
> directory.
> # Automatically collect the backtraces from the minidumps and save them into 
> $IMPALA_HOME/logs directory in a text file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-4044) Archive impalad, statestored, catalogd along with minidump for Jenkins jobs

2018-09-20 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated IMPALA-4044:

Labels: breakpad test-infra  (was: test-infra)

> Archive impalad, statestored, catalogd along with minidump for Jenkins jobs
> ---
>
> Key: IMPALA-4044
> URL: https://issues.apache.org/jira/browse/IMPALA-4044
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.6.0
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: breakpad, test-infra
>
> I've noticed that some Jenkins jobs do not archive the impala binaries in 
> case of a crash where only a minidump is generated and not a core dump.
> This makes it hard to resolve the minidump without the symbols.
> Example job:
> http://sandbox.jenkins.cloudera.com/job/impala-private-build-and-test/4076/
> (Marked as keep forever now, should unmark once done)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622239#comment-16622239
 ] 

Tim Armstrong commented on IMPALA-7310:
---

WRT to throwing away data, the philosophy in the Impala planner is generally 
not to try to hard to infer statistics if "proper" statistics are missing. E.g. 
we don't try to estimate row count from file size, so the planner can make 
"obviously dumb" decisions with join ordering in the absence of stats.

I think the motivation was to avoid investing too much effort (implementation 
and maintenance) to handle cases where best practices weren't followed, and 
also as a way to discourage bad practices.

I don't know if this is the right decision, but there's some sense to it.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org