[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-27 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-19889:

Fix Version/s: (was: 4.0.0)
   3.1.0

> Wrong results due to PPD of non deterministic functions with CBO
> 
>
> Key: HIVE-19889
> URL: https://issues.apache.org/jira/browse/HIVE-19889
> Project: Hive
>  Issue Type: Bug
>Reporter: Janaki Lahorani
>Assignee: Janaki Lahorani
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HIVE-19889.1.patch, HIVE-19889.2.patch
>
>
> The following query can give wrong results when CBO is on:
> {code}
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where 
> part1='CA' and part2 = 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: testa
> Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: 'CA' (type: string), rand() (type: double)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
> Limit
>   Number of rows: 20
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: 20
>   Processor Tree:
> ListSink
> {code}
> The relevant part in the plan is the filter:
> {code}
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
> {code}
> The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down.  And 
> randum123 was resolved to rand().  This is bad because it will result in 
> invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
> rand calls can generate values that can satisfy the predicates independently, 
> but not together, whereas the original intention of the query is to give 
> results when rand falls between 0.25 and 0.5.
> A sample result:
> {code}
> CA0.9191984370369802
> CA0.397933021566812
> {code}
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-21 Thread Naveen Gangam (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naveen Gangam updated HIVE-19889:
-
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Fix has been committed to master. Thank you for your contribution [~janulatha]

> Wrong results due to PPD of non deterministic functions with CBO
> 
>
> Key: HIVE-19889
> URL: https://issues.apache.org/jira/browse/HIVE-19889
> Project: Hive
>  Issue Type: Bug
>Reporter: Janaki Lahorani
>Assignee: Janaki Lahorani
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-19889.1.patch, HIVE-19889.2.patch
>
>
> The following query can give wrong results when CBO is on:
> {code}
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where 
> part1='CA' and part2 = 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: testa
> Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: 'CA' (type: string), rand() (type: double)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
> Limit
>   Number of rows: 20
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: 20
>   Processor Tree:
> ListSink
> {code}
> The relevant part in the plan is the filter:
> {code}
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
> {code}
> The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down.  And 
> randum123 was resolved to rand().  This is bad because it will result in 
> invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
> rand calls can generate values that can satisfy the predicates independently, 
> but not together, whereas the original intention of the query is to give 
> results when rand falls between 0.25 and 0.5.
> A sample result:
> {code}
> CA0.9191984370369802
> CA0.397933021566812
> {code}
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-18 Thread Janaki Lahorani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janaki Lahorani updated HIVE-19889:
---
Description: 
The following query can give wrong results when CBO is on:
{code}
select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' 
and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;

The plan of the query is as follows:
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: testa
Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
Column stats: NONE
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: 'CA' (type: string), rand() (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
Limit
  Number of rows: 20
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2290 Basic stats: 
COMPLETE Column stats: NONE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 20
  Processor Tree:
ListSink
{code}

The relevant part in the plan is the filter:

{code}
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
{code}

The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down.  And 
randum123 was resolved to rand().  This is bad because it will result in 
invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
rand calls can generate values that can satisfy the predicates independently, 
but not together, whereas the original intention of the query is to give 
results when rand falls between 0.25 and 0.5.

A sample result:

{code}
CA  0.9191984370369802
CA  0.397933021566812
{code}

where the condition was not satisfied.

  was:
The following query can give wrong results when CBO is on:
{code}
select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' 
and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;

The plan of the query is as follows:
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: testa
Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
Column stats: NONE
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: 'CA' (type: string), rand() (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
Limit
  Number of rows: 20
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2290 Basic stats: 
COMPLETE Column stats: NONE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 20
  Processor Tree:
ListSink
{code}

The relevant part in the plan is the filter:

{code}
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
{code}

The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  And 
randum123 was resolved to rand().  This is bad because it will result in 
invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
rand calls can generate values that can satisfy the predicates independently, 
but not together, whereas the original intention of the query is to give 
results when rand falls between 0.25 and 0.5.

A sample result:

{code}
CA  

[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-15 Thread Janaki Lahorani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janaki Lahorani updated HIVE-19889:
---
Attachment: HIVE-19889.2.patch

> Wrong results due to PPD of non deterministic functions with CBO
> 
>
> Key: HIVE-19889
> URL: https://issues.apache.org/jira/browse/HIVE-19889
> Project: Hive
>  Issue Type: Bug
>Reporter: Janaki Lahorani
>Assignee: Janaki Lahorani
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-19889.1.patch, HIVE-19889.2.patch
>
>
> The following query can give wrong results when CBO is on:
> {code}
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where 
> part1='CA' and part2 = 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: testa
> Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: 'CA' (type: string), rand() (type: double)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
> Limit
>   Number of rows: 20
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: 20
>   Processor Tree:
> ListSink
> {code}
> The relevant part in the plan is the filter:
> {code}
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
> {code}
> The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  
> And randum123 was resolved to rand().  This is bad because it will result in 
> invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
> rand calls can generate values that can satisfy the predicates independently, 
> but not together, whereas the original intention of the query is to give 
> results when rand falls between 0.25 and 0.5.
> A sample result:
> {code}
> CA0.9191984370369802
> CA0.397933021566812
> {code}
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-14 Thread Zoltan Haindrich (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich updated HIVE-19889:

Description: 
The following query can give wrong results when CBO is on:
{code}
select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' 
and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;

The plan of the query is as follows:
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: testa
Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
Column stats: NONE
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: 'CA' (type: string), rand() (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
Limit
  Number of rows: 20
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2290 Basic stats: 
COMPLETE Column stats: NONE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 20
  Processor Tree:
ListSink
{code}

The relevant part in the plan is the filter:

{code}
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
{code}

The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  And 
randum123 was resolved to rand().  This is bad because it will result in 
invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
rand calls can generate values that can satisfy the predicates independently, 
but not together, whereas the original intention of the query is to give 
results when rand falls between 0.25 and 0.5.

A sample result:

{code}
CA  0.9191984370369802
CA  0.397933021566812
{code}

where the condition was not satisfied.

  was:
The following query can give wrong results when CBO is on:
select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' 
and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;

The plan of the query is as follows:
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: testa
Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
Column stats: NONE
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: 'CA' (type: string), rand() (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
Limit
  Number of rows: 20
  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2290 Basic stats: 
COMPLETE Column stats: NONE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 20
  Processor Tree:
ListSink

The relevant part in the plan is the filter:
Filter Operator
  predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)

The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  And 
randum123 was resolved to rand().  This is bad because it will result in 
invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
rand calls can generate values that can satisfy the predicates independently, 
but not together, whereas the original intention of the query is to give 
results when rand falls between 0.25 and 0.5.

A sample result:
CA  0.9191984370369802
CA  

[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-13 Thread Janaki Lahorani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janaki Lahorani updated HIVE-19889:
---
Attachment: HIVE-19889.1.patch

> Wrong results due to PPD of non deterministic functions with CBO
> 
>
> Key: HIVE-19889
> URL: https://issues.apache.org/jira/browse/HIVE-19889
> Project: Hive
>  Issue Type: Bug
>Reporter: Janaki Lahorani
>Assignee: Janaki Lahorani
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-19889.1.patch
>
>
> The following query can give wrong results when CBO is on:
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where 
> part1='CA' and part2 = 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: testa
> Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: 'CA' (type: string), rand() (type: double)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
> Limit
>   Number of rows: 20
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: 20
>   Processor Tree:
> ListSink
> The relevant part in the plan is the filter:
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
> The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  
> And randum123 was resolved to rand().  This is bad because it will result in 
> invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
> rand calls can generate values that can satisfy the predicates independently, 
> but not together, whereas the original intention of the query is to give 
> results when rand falls between 0.25 and 0.5.
> A sample result:
> CA0.9191984370369802
> CA0.397933021566812
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19889) Wrong results due to PPD of non deterministic functions with CBO

2018-06-13 Thread Janaki Lahorani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janaki Lahorani updated HIVE-19889:
---
Fix Version/s: 4.0.0
   Status: Patch Available  (was: Open)

> Wrong results due to PPD of non deterministic functions with CBO
> 
>
> Key: HIVE-19889
> URL: https://issues.apache.org/jira/browse/HIVE-19889
> Project: Hive
>  Issue Type: Bug
>Reporter: Janaki Lahorani
>Assignee: Janaki Lahorani
>Priority: Major
> Fix For: 4.0.0
>
>
> The following query can give wrong results when CBO is on:
> select * from (
> select part1,randum123
> from (SELECT *, cast(rand() as double) AS randum123 FROM testA where 
> part1='CA' and part2 = 'ABC') a
> where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
> The plan of the query is as follows:
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: testa
> Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: 'CA' (type: string), rand() (type: double)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE 
> Column stats: NONE
> Limit
>   Number of rows: 20
>   Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 2290 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: 20
>   Processor Tree:
> ListSink
> The relevant part in the plan is the filter:
> Filter Operator
>   predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: 
> boolean)
> The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down.  
> And randum123 was resolved to rand().  This is bad because it will result in 
> invocation of rand() two times and rand() UDF is non-deterministic.  Both the 
> rand calls can generate values that can satisfy the predicates independently, 
> but not together, whereas the original intention of the query is to give 
> results when rand falls between 0.25 and 0.5.
> A sample result:
> CA0.9191984370369802
> CA0.397933021566812
> where the condition was not satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)