[jira] [Updated] (HIVE-12678) BETWEEN relational operator sometimes returns incorrect results against PARQUET tables

2015-12-16 Thread Nicholas Brenwald (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Brenwald updated HIVE-12678:
-
Affects Version/s: 1.2.1

> BETWEEN relational operator sometimes returns incorrect results against 
> PARQUET tables
> --
>
> Key: HIVE-12678
> URL: https://issues.apache.org/jira/browse/HIVE-12678
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.2.1
>Reporter: Nicholas Brenwald
>
> When querying a parquet table, the BETWEEN relational operator returns 
> incorrect results when hive.optimize.index.filter and 
> hive.optimize.ppd.storage are enabled
> Create a parquet table:
> {code}
> create table t(c string) stored as parquet;
> {code}
> Insert some strings representing dates
> {code}
> insert into t select '2015-12-09' from default.dual limit 1;
> insert into t select '2015-12-10' from default.dual limit 1;
> insert into t select '2015-12-11' from default.dual limit 1;
> {code}
> h3. Example 1
> This query correctly returns 3:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c >= '2015-12-09' and c <= '2015-12-11';
> +--+--+
> | _c0  |
> +--+--+
> | 3|
> +--+--+
> {code}
> This query incorrectly returns 1:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c between '2015-12-09' and '2015-12-11';
> +--+--+
> | _c0  |
> +--+--+
> | 1|
> +--+--+
> {code}
> Disabling hive.optimize.findex.filter resolves the problem. This query now 
> correctly returns 3:
> {code}
> set hive.optimize.index.filter=false;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c between '2015-12-09' and '2015-12-11';
> +--+--+
> | _c0  |
> +--+--+
> | 3|
> +--+--+
> {code}
> Disabling hive.optimize.ppd.storage resolves the problem. This query now 
> correctly returns 3:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=false;
> select count(*) from t where c between '2015-12-09' and '2015-12-11';
> +--+--+
> | _c0  |
> +--+--+
> | 3|
> +--+--+
> {code}
> h3. Example 2
> This query correctly returns 1:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c >=  '2015-12-10' and c <= '2015-12-10';
> +--+--+
> | _c0  |
> +--+--+
> | 1|
> +--+--+
> {code}
> This query incorrectly returns 0:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c between '2015-12-10' and '2015-12-10';
> +--+--+
> | _c0  |
> +--+--+
> | 0|
> +--+--+
> {code}
> Disabling hive.optimize.findex.filter resolves the problem. This query now 
> correctly returns 1:
> {code}
> set hive.optimize.index.filter=false;
> set hive.optimize.ppd.storage=true;
> select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
> +--+--+
> | _c0  |
> +--+--+
> | 1|
> +--+--+
> {code}
> Disabling hive.optimize.ppd.storage resolves the problem. This query now 
> correctly returns 1:
> {code}
> set hive.optimize.index.filter=true;
> set hive.optimize.ppd.storage=false;
> select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
> +--+--+
> | _c0  |
> +--+--+
> | 1|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11603) IndexOutOfBoundsException thrown when accessing a union all subquery and filtering on a column which does not exist in all underlying tables

2015-10-30 Thread Nicholas Brenwald (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983285#comment-14983285
 ] 

Nicholas Brenwald commented on HIVE-11603:
--

Hi [~jpullokkaran] no I haven't had time to work on this, if you could take a 
look that would be great.

> IndexOutOfBoundsException thrown when accessing a union all subquery and 
> filtering on a column which does not exist in all underlying tables
> 
>
> Key: HIVE-11603
> URL: https://issues.apache.org/jira/browse/HIVE-11603
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.14.0, 1.3.0, 1.2.1
> Environment: Hadoop 2.6
>Reporter: Nicholas Brenwald
>Priority: Minor
> Fix For: 2.0.0
>
>
> Create two empty tables t1 and t2
> {code}
> CREATE TABLE t1(c1 STRING);
> CREATE TABLE t2(c1 STRING, c2 INT);
> {code}
> Create a view on these two tables
> {code}
> CREATE VIEW v1 AS 
> SELECT c1, c2 
> FROM (
> SELECT c1, CAST(NULL AS INT) AS c2 FROM t1
> UNION ALL
> SELECT c1, c2 FROM t2
> ) x;
> {code}
> Then run
> {code}
> SELECT COUNT(*) from v1 
> WHERE c2 = 0;
> {code}
> We expect to get a result of zero, but instead the query fails with stack 
> trace:
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.initializeOp(UnionOperator.java:86)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:119)
>   ... 22 more
> {code}
> Workarounds include disabling ppd,
> {code}
> set hive.optimize.ppd=false;
> {code}
> Or changing the view so that column c2 is null cast to double:
> {code}
> CREATE VIEW v1_workaround AS 
> SELECT c1, c2 
> FROM (
> SELECT c1, CAST(NULL AS DOUBLE) AS c2 FROM t1
> UNION ALL
> SELECT c1, c2 FROM t2
> ) x;
> {code}
> The problem seems to occur in branch-1.1, branch-1.2, branch-1 but seems to 
> be resolved in master (2.0.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11603) IndexOutOfBoundsException thrown when accessing a union all subquery and filtering on a column which does not exist in all underlying tables

2015-08-19 Thread Nicholas Brenwald (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702941#comment-14702941
 ] 

Nicholas Brenwald commented on HIVE-11603:
--

Query plan from branch-1:
{code}
:~/$ hive
Logging initialized using configuration in 
jar:file:~/branch-1/hive/packaging/target/apache-hive-1.3.0-SNAPSHOT-bin/apache-hive-1.3.0-SNAPSHOT-bin/lib/hive-common-1.3.0-SNAPSHOT.jar!/hive-log4j.properties
hive EXPLAIN SELECT COUNT(*) from v1 WHERE c2 = 0;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: t1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: COMPLETE
Filter Operator
  predicate: false (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: COMPLETE
  Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: PARTIAL
Select Operator
  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: PARTIAL
  Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: PARTIAL
Reduce Output Operator
  sort order: 
  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: PARTIAL
  value expressions: _col0 (type: bigint)
  TableScan
alias: t2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: (c2 = 0) (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
Union
  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: PARTIAL
  Select Operator
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: PARTIAL
Group By Operator
  aggregations: count()
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: PARTIAL
  Reduce Output Operator
sort order: 
Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: PARTIAL
value expressions: _col0 (type: bigint)
  Reduce Operator Tree:
Group By Operator
  aggregations: count(VALUE._col0)
  mode: mergepartial
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: PARTIAL
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

Query plan from branch master 
{code}
:~/$ hive
Logging initialized using configuration in 
jar:file:~/master/hive/packaging/target/apache-hive-2.0.0-SNAPSHOT-bin/apache-hive-2.0.0-SNAPSHOT-bin/lib/hive-common-2.0.0-SNAPSHOT.jar!/hive-log4j2.xml
hive EXPLAIN SELECT COUNT(*) from v1 WHERE c2 = 0;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: t1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: false (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
expressions: c1 (type: string), null (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
Union
  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE

[jira] [Commented] (HIVE-11410) Join with subquery containing a group by incorrectly returns no results

2015-08-03 Thread Nicholas Brenwald (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652089#comment-14652089
 ] 

Nicholas Brenwald commented on HIVE-11410:
--

[~mmccline] I have done some further testing today compiling from source 
various branches. 

The issue only seems to be present in release-1.1.0 (which is part of the 
Cloudera distribution we use). The issue cannot be reproduced in branch-1.1 or 
branch-1.2 (even when using our environment variables/hive-site.xml etc). As 
such I think this can be marked as resolved. 
Thanks for looking into this and sorry for the false alarm.

 Join with subquery containing a group by incorrectly returns no results
 ---

 Key: HIVE-11410
 URL: https://issues.apache.org/jira/browse/HIVE-11410
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.1.0
Reporter: Nicholas Brenwald
Assignee: Matt McCline
Priority: Minor
 Attachments: hive-site.xml


 Start by creating a table *t* with columns *c1* and *c2* and populate with 1 
 row of data. For example create table *t* from an existing table which 
 contains at least 1 row of data by running:
 {code}
 create table t as select 'abc' as c1, 0 as c2 from Y limit 1; 
 {code}
 Table *t* looks like the following:
 ||c1||c2||
 |abc|0|
 Running the following query then returns zero results.
 {code}
 SELECT 
   t1.c1
 FROM 
   t t1
 JOIN
 (SELECT 
t2.c1,
MAX(t2.c2) AS c2
  FROM 
t t2 
  GROUP BY 
t2.c1
 ) t3
 ON t1.c2=t3.c2
 {code}
 However, we expected to see the following:
 ||c1||
 |abc|
 The problem seems to relate to the fact that in the subquery, we group by 
 column *c1*, but this is not subsequently used in the join condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11410) Join with subquery containing a group by incorrectly returns no results

2015-07-31 Thread Nicholas Brenwald (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649184#comment-14649184
 ] 

Nicholas Brenwald commented on HIVE-11410:
--

Hi, 
Thanks for taking a look at this so quickly.
I confirm we are using branch-1.1 (distributed as part of CDH 5.4.4). For 
example, hive cli jar is named hive-cli-1.1.0-cdh5.4.4.jar. When we run 'hive' 
on the command line, we see the following printed message showing the 
hive-common-1.1.0 is being used.
{code}
Logging initialized using configuration in 
jar:file:/cloudera/parcel-repo/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/hive-common-1.1.0-cdh5.4.4.jar!/hive-log4j.properties
{code}

And the explain plan we see is as follows:
{code}
hive EXPLAIN
 SELECT 
   t1.c1
 FROM 
   t t1
 JOIN
 (SELECT 
t2.c1,
MAX(t2.c2) AS c2
  FROM 
t t2 
  GROUP BY 
t2.c1
 ) t3
 ON t1.c2=t3.c2;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-5 depends on stages: Stage-1
  Stage-4 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-4

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: t2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
Select Operator
  expressions: c1 (type: string), c2 (type: int)
  outputColumnNames: c1, c2
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
  Group By Operator
aggregations: max(c2)
keys: c1 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
  value expressions: _col1 (type: int)
  Reduce Operator Tree:
Group By Operator
  aggregations: max(VALUE._col0)
  keys: KEY._col0 (type: string)
  mode: mergepartial
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: 
NONE
  Filter Operator
predicate: _col1 is not null (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
File Output Operator
  compressed: false
  table:
  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-5
Map Reduce Local Work
  Alias - Map Local Tables:
t1 
  Fetch Operator
limit: -1
  Alias - Map Local Operator Tree:
t1 
  TableScan
alias: t1
filterExpr: c2 is not null (type: boolean)
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
Filter Operator
  predicate: c2 is not null (type: boolean)
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
  HashTable Sink Operator
keys:
  0 c2 (type: int)
  1 _col1 (type: int)

  Stage: Stage-4
Map Reduce
  Map Operator Tree:
  TableScan
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 c2 (type: int)
1 _col1 (type: int)
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: true
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Local Work:
Map Reduce Local Work

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

 Join with subquery containing a group by incorrectly returns no results
 

[jira] [Updated] (HIVE-11410) Join with subquery containing a group by incorrectly returns no results

2015-07-30 Thread Nicholas Brenwald (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Brenwald updated HIVE-11410:
-
Attachment: hive-site.xml

 Join with subquery containing a group by incorrectly returns no results
 ---

 Key: HIVE-11410
 URL: https://issues.apache.org/jira/browse/HIVE-11410
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.1.0
Reporter: Nicholas Brenwald
Priority: Minor
 Attachments: hive-site.xml


 Start by creating a table *t* with columns *c1* and *c2* and populate with 1 
 row of data. For example create table *t* from an existing table which 
 contains at least 1 row of data by running:
 {code}
 create table t as select 'abc' as c1, 0 as c2 from Y limit 1; 
 {code}
 Table *t* looks like the following:
 ||c1||c2||
 |abc|0|
 Running the following query then returns zero results.
 {code}
 SELECT 
   t1.c1
 FROM 
   t t1
 JOIN
 (SELECT 
t2.c1,
MAX(t2.c2) AS c2
  FROM 
t t2 
  GROUP BY 
t2.c1
 ) t3
 ON t1.c2=t3.c2
 {code}
 However, we expected to see the following:
 ||c1||
 |abc|
 The problem seems to relate to the fact that in the subquery, we group by 
 column *c1*, but this is not subsequently used in the join condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)