[jira] [Created] (HIVE-14086) org.apache.hadoop.hive.metastore.api.Table does not return columns from Avro schema file

2016-06-23 Thread Lars Volker (JIRA)
Lars Volker created HIVE-14086:
--

 Summary: org.apache.hadoop.hive.metastore.api.Table does not 
return columns from Avro schema file
 Key: HIVE-14086
 URL: https://issues.apache.org/jira/browse/HIVE-14086
 Project: Hive
  Issue Type: Bug
  Components: API
Reporter: Lars Volker


Consider this table, using an external Avro schema file:

{noformat}
CREATE TABLE avro_table
  PARTITIONED BY (str_part STRING)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
'avro.schema.url'='hdfs://localhost:20500/tmp/avro.json'
  );
{noformat}

This will populate the "COLUMNS_V2" metastore table with the correct column 
information (as per HIVE-6308). The columns of this table can then be queried 
via the Hive API, for example by calling {{.getSd().getCols()}} on a 
{{org.apache.hadoop.hive.metastore.api.Table}} object.

Changes to the avro.schema.url file - either changing where it points to or 
changing its contents - will be reflected in the output of {{describe formatted 
avro_table}} *but not* in the result of the {{.getSd().getCols()}} API call. 
Instead it looks like Hive only reads the Avro schema file internally, but does 
not expose the information therein via its API.

Is there a way to obtain the effective Table information via Hive? Would it 
make sense to fix table retrieval so calls to {{get_table}} return the correct 
set of columns?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-14085) Allow type widening primitive conversion on hive/parquet tables

2016-06-23 Thread Vihang Karajgaonkar (JIRA)
Vihang Karajgaonkar created HIVE-14085:
--

 Summary: Allow type widening primitive conversion on hive/parquet 
tables
 Key: HIVE-14085
 URL: https://issues.apache.org/jira/browse/HIVE-14085
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 2.1.0
Reporter: Vihang Karajgaonkar
Assignee: Vihang Karajgaonkar


There is a JIRA ticket on upstream that brought this usability improvement in 
Hive to support auto type widening for Parquet tables. See 
https://issues.apache.org/jira/browse/HIVE-12080
This improvement is very useful for users who have schema evolution on their 
tables. For example, a Hive table with a "bigint" can read parquet files with 
"int32" and "int64" types.
The patch only supports widening conversions from int->bigint and 
float->double. We should support more types to allow users read their changed 
parquet schema.
Here's a list of widening conversions we should support:
{code}
tinyint ->  smallint,int,bigint,float,double
smallint  -> int,bigint,float,double
int  -> bigint,float,double
bigint -> float,double
float   ->  double
double   ->  --
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-14084) Branch-1: HIVE-13985 backport to branch-1 introduced regression

2016-06-23 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-14084:


 Summary: Branch-1: HIVE-13985 backport to branch-1 introduced 
regression
 Key: HIVE-14084
 URL: https://issues.apache.org/jira/browse/HIVE-14084
 Project: Hive
  Issue Type: Bug
  Components: ORC
Affects Versions: 1.3.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran


HIVE-13985 backport for branch-1 caused regression reverting some changes from 
HIVE-11928 (protobuf message size exceeding 64MB when reading footer and 
metadata)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 48233: HIVE-13884: Disallow queries fetching more than a configured number of partitions in PartitionPruner

2016-06-23 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48233/
---

(Updated June 23, 2016, 9:36 p.m.)


Review request for hive, Mohit Sabharwal and Naveen Gangam.


Changes
---

Attaching new patch with changes on ObjectStore due to HIVE-14055


Bugs: HIVE-13884
https://issues.apache.org/jira/browse/HIVE-13884


Repository: hive-git


Description
---

The patch verifies the # of partitions a table has before fetching any from the 
metastore. I
t checks that limit from 'hive.limit.query.max.table.partition'.

A limitation added here is that the variable must be on hive-site.xml in order 
to work, and it does not accept to set this through beeline because 
HiveMetaStore.java does not read the variables set through beeline. I think it 
is better to keep it this way to avoid users changing the value on fly, and 
crashing the metastore.

Another change is that EXPLAIN commands won't be executed either. EXPLAIN 
commands need to fetch partitions in order to create the operator tree. If we 
allow EXPLAIN to do that, then we may have the same OOM situations for large 
partitions.


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
1d1306ff6395a0504085dda98e96c3951519f299 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
c0827ea9d47e569d9697649a7e16d196de3de14d 
  metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java 
b809269d5b1775fcd57af62b254476627ab062cd 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
f67efcdf301b0e5e71ef1a4b7315b4184598d5b7 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java 
a6d3f5385b33b8a4e31ee20ca5cb8f58c97c8702 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/HBaseStore.java 
2f837bb12d4ced1e81fbd86a8104a16b9e3174a8 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
 3152e77c3c7152ac4dbe7e779ce35f28044fe3c9 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 86a243609b23e2ca9bb8849f0da863a95e477d5c 

Diff: https://reviews.apache.org/r/48233/diff/


Testing
---

Waiting for HiveQA.


Thanks,

Sergio Pena



[jira] [Created] (HIVE-14083) ALTER INDEX in Tez causes NullPointerException

2016-06-23 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-14083:
--

 Summary: ALTER INDEX in Tez causes NullPointerException
 Key: HIVE-14083
 URL: https://issues.apache.org/jira/browse/HIVE-14083
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Jesus Camacho Rodriguez


ALTER INDEX causes a NullPointerException when run under TEZ execution engine. 
Query runs without issue when submitted using MR execution mode.

To reproduce:

1. CREATE INDEX sample_08_index ON TABLE sample_08 (code) AS 'COMPACT' WITH 
DEFERRED REBUILD; 

2. ALTER INDEX sample_08_index ON sample_08 REBUILD; 

*Stacktrace from Hive 1.2.1*
{code:java}
ERROR : Vertex failed, vertexName=Map 1, 
vertexId=vertex_1460577396252_0005_1_00, diagnostics=[Task failed, 
taskId=task_1460577396252_0005_1_00_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:135)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:390)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:269)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:233)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 25 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-14082) Multi-Insert Query Fails with GROUP BY, DISTINCT, and WHERE clauses

2016-06-23 Thread Sahil Takiar (JIRA)
Sahil Takiar created HIVE-14082:
---

 Summary: Multi-Insert Query Fails with GROUP BY, DISTINCT, and 
WHERE clauses
 Key: HIVE-14082
 URL: https://issues.apache.org/jira/browse/HIVE-14082
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.1.0, 1.1.0
Reporter: Sahil Takiar


The following MULTI-INSERT Query Fails in Hive. I've listed the query required 
to re-produce this failure, as well as a few similar queries that work properly.

Setup Queries:

{code}
DROP SCHEMA IF EXISTS multi_table_insert_bug CASCADE;
CREATE SCHEMA multi_table_insert_bug;
USE multi_table_insert_bug;

DROP TABLE IF EXISTS multi_table_insert_source;
DROP TABLE IF EXISTS multi_table_insert_test;

CREATE TABLE multi_table_insert_source (
  date_column DATE,
  column_1 STRING,
  column_2 STRING,
  column_3 STRING,
  column_4 STRING
);

CREATE TABLE multi_table_insert_test (
  column_1 STRING,
  column_2 STRING,
  line_count INT,
  distinct_count_by_1_column INT,
  distinct_count_by_2_columns INT
)
PARTITIONED BY (partition_column INT);

INSERT OVERWRITE TABLE multi_table_insert_source VALUES
  ('2016-01-22', 'value_1_1', 'value_1_2', 'value_1_3', 'value_1_4'),
  ('2016-01-22', 'value_2_1', 'value_2_2', 'value_2_3', 'value_2_4'),
  ('2016-01-22', 'value_3_1', 'value_3_2', 'value_3_3', 'value_3_4'),
  ('2016-01-22', 'value_4_1', 'value_4_2', 'value_4_3', 'value_4_4'),
  ('2016-01-22', 'value_5_1', 'value_5_2', 'value_5_3', 'value_5_4');
{code}


The following queries run successfully:

*Query 1:*

{code}
FROM multi_table_insert_source
  INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column = 
365)
  SELECT
column_1,
column_2,
COUNT(*) AS line_count,
COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
  GROUP BY
column_1,
column_2;
{code}

*Query 2:*

{code}
FROM multi_table_insert_source
  INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column = 
365)
  SELECT
column_1,
column_2,
COUNT(*) AS line_count,
COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
--  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
  GROUP BY
column_1,
column_2
  INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column = 
1096)
  SELECT
column_1,
column_2,
COUNT(*) AS line_count,
COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
--  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 1096)
  GROUP BY
column_1,
column_2;
{code}

The following query fails with an {{IndexOutOfBoundsException Index: 3, Size: 
3}} the only difference between this query and the previous one is the WHERE 
clause that I've commented out above.

*Query 3:*

{code}
FROM multi_table_insert_source
  INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column = 
365)
  SELECT
column_1,
column_2,
COUNT(*) AS line_count,
COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
  GROUP BY
column_1,
column_2
  INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column = 
1096)
  SELECT
column_1,
column_2,
COUNT(*) AS line_count,
COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 1096)
  GROUP BY
column_1,
column_2;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 48500: HIVE-13982

2016-06-23 Thread Ashutosh Chauhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48500/#review139251
---




ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelCollationPropagator.java
 (line 117)


I may be wrong, but doesn't this mean we may trigger dispatch() on all 
nodes underneath which implies this is O(n*2) algorithm.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelCollationPropagator.java
 (line 140)


We may allow some (order-preserving) udfs here. Wanna leave a TODO ?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelCollationPropagator.java
 (lines 196 - 197)


Recursive calls?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/PlanModifierForASTConv.java
 (lines 83 - 88)


Shall we add a flag with default value true to enable propagator? 
Apart from debugging flag in this case will also be useful to quantify perf 
gains.



ql/src/test/queries/clientpositive/reduce_deduplicate_extended2.q (line 3)


We need to explore this further, since auto convert join is on by default, 
implying RSdedup is off by default whenever join is in pipeline.



ql/src/test/results/clientpositive/annotate_stats_join.q.out (lines 383 - 385)


nice.. we got rid of redundant col in RS.



ql/src/test/results/clientpositive/correlationoptimizer13.q.out (line 39)


This can now be removed :)



ql/src/test/results/clientpositive/filter_cond_pushdown.q.out (lines 309 - 311)


This can be further optimized. We can drop constants from sort & 
partitioning columns.



ql/src/test/results/clientpositive/ptfgroupbyjoin.q.out (lines 82 - 85)


Redundant sel op? Case for IdentityProjectRemover?



ql/src/test/results/clientpositive/reduce_deduplicate_extended2.q.out (line 18)


Better plan here would be a single stage with GBY executing with mode = 
complete. Since we always generate map-side GBY at plan generation time, this 
implies collapsing this GBy after RS dedup.



ql/src/test/results/clientpositive/reduce_deduplicate_extended2.q.out (line 123)


Here also, we can have single stage, but depending on key cardinality that 
may be suboptimal since # of reducer for OBy = 1. 

At some point we need to re-order optimizations such that RS dedup runs 
after StatsAnnotation rule so that we can properly make such decision.



ql/src/test/results/clientpositive/reduce_deduplicate_extended2.q.out (line 239)


Here as well.. last stage is not useful.



ql/src/test/results/clientpositive/subquery_in.q.out 


nice!



ql/src/test/results/clientpositive/subquery_in_having.q.out (lines 286 - 289)


case for IdentityProjectRemover?



ql/src/test/results/clientpositive/subquery_unqualcolumnrefs.q.out 


nice!



ql/src/test/results/clientpositive/tez/subquery_in.q.out 


nice!



ql/src/test/results/clientpositive/union25.q.out (line 110)


duplicated columns in sort/partition columns.



ql/src/test/results/clientpositive/unionDistinct_1.q.out (line 10023)


duplicated key columns.



ql/src/test/results/clientpositive/vector_groupby_reduce.q.out 


Awesome!



ql/src/test/results/clientpositive/vectorization_13.q.out (line 101)


Awesome!



ql/src/test/results/clientpositive/vectorization_short_regress.q.out (line 2315)


Awesome!


- Ashutosh Chauhan


On June 22, 2016, 8:27 p.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48500/
> ---
> 
> (Updated June 22, 2016, 8:27 p.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13982
>