[jira] [Assigned] (HIVE-25495) Upgrade to JLine3

2021-09-01 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor reassigned HIVE-25495:
-


> Upgrade to JLine3
> -
>
> Key: HIVE-25495
> URL: https://issues.apache.org/jira/browse/HIVE-25495
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> Jline 2 has been discontinued a long while ago.  Hadoop uses JLine3 so Hive 
> should match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25453) Add LLAP IO support for Iceberg ORC tables

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25453?focusedWorklogId=645277=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645277
 ]

ASF GitHub Bot logged work on HIVE-25453:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 13:59
Start Date: 01/Sep/21 13:59
Worklog Time Spent: 10m 
  Work Description: szlta commented on pull request #2586:
URL: https://github.com/apache/hive/pull/2586#issuecomment-910312951


   Force pushed for master rebase + addressed review comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645277)
Time Spent: 2h 50m  (was: 2h 40m)

> Add LLAP IO support for Iceberg ORC tables
> --
>
> Key: HIVE-25453
> URL: https://issues.apache.org/jira/browse/HIVE-25453
> Project: Hive
>  Issue Type: New Feature
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25453) Add LLAP IO support for Iceberg ORC tables

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25453?focusedWorklogId=645258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645258
 ]

ASF GitHub Bot logged work on HIVE-25453:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 13:08
Start Date: 01/Sep/21 13:08
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #2586:
URL: https://github.com/apache/hive/pull/2586#discussion_r700194971



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##
@@ -2693,6 +2696,77 @@ public static TypeDescription 
getDesiredRowTypeDescr(Configuration conf,
 return result;
   }
 
+  /**
+   * Based on the file schema and the low level file includes provided in the 
SchemaEvolution instance, this method
+   * calculates which top level columns should be included i.e. if any of the 
nested columns inside complex types is
+   * required, then its relevant top level parent column will be considered as 
required (and thus the full subtree).
+   * Hive and LLAP currently only supports column pruning on the first level, 
thus we need to calculate this ourselves.

Review comment:
   ACID is a special case, and I think it does. For ACID tables what Hive 
pushed down as projection is already a list of column indices within the row 
struct. So within this row struct for the first level columns, it should work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645258)
Time Spent: 2h 40m  (was: 2.5h)

> Add LLAP IO support for Iceberg ORC tables
> --
>
> Key: HIVE-25453
> URL: https://issues.apache.org/jira/browse/HIVE-25453
> Project: Hive
>  Issue Type: New Feature
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25453) Add LLAP IO support for Iceberg ORC tables

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25453?focusedWorklogId=645256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645256
 ]

ASF GitHub Bot logged work on HIVE-25453:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 13:03
Start Date: 01/Sep/21 13:03
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #2586:
URL: https://github.com/apache/hive/pull/2586#discussion_r700191437



##
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapRecordReader.java
##
@@ -158,8 +167,11 @@ private LlapRecordReader(MapWork mapWork, JobConf job, 
FileSplit split,
 rbCtx = ctx != null ? ctx : LlapInputFormat.createFakeVrbCtx(mapWork);
 
 isAcidScan = AcidUtils.isFullAcidScan(jobConf);
-TypeDescription schema = OrcInputFormat.getDesiredRowTypeDescr(
-job, isAcidScan, Integer.MAX_VALUE);
+
+String icebergOrcSchema = 
job.get(ColumnProjectionUtils.ICEBERG_ORC_SCHEMA_STRING);

Review comment:
   Yeah it's unfortunate. This is the equivalent of non-LLAP but vectorized 
ORC case at 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcInputFormat.java#L80
   
   Hive sets IOConstants.SCHEMA_EVOLUTION_COLUMNS and 
SCHEMA_EVOLUTION_COLUMNS_TYPES during compile and pushes this down to 
execution. These are simple string representations of Hive types so not ORC 
specific, and relates to LOGICAL schema.
   Iceberg - in order to support a broader schema evolution - produces the FILE 
schema based on file info and logical type info, and the result is an ORC 
TypeDescription instance. (See VectorizedReadUtils.handleIcebergProjection) I 
have found no easy way to transform this object back into Hive types, the 
conversion only exists in the other direction: 
OrcInputFormat.typeDescriptionsFromHiveTypeProperty()




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645256)
Time Spent: 2.5h  (was: 2h 20m)

> Add LLAP IO support for Iceberg ORC tables
> --
>
> Key: HIVE-25453
> URL: https://issues.apache.org/jira/browse/HIVE-25453
> Project: Hive
>  Issue Type: New Feature
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25453) Add LLAP IO support for Iceberg ORC tables

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25453?focusedWorklogId=645250=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645250
 ]

ASF GitHub Bot logged work on HIVE-25453:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 12:53
Start Date: 01/Sep/21 12:53
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #2586:
URL: https://github.com/apache/hive/pull/2586#discussion_r700182811



##
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapIoImpl.java
##
@@ -417,15 +419,21 @@ public OrcTail getOrcTailFromCache(Path path, 
Configuration jobConf, CacheTag ta
   }
 
   @Override
-  public RecordReader 
llapVectorizedOrcReaderForPath(Object fileKey, Path path, CacheTag tag, 
List tableIncludedCols,
-  JobConf conf, long offset, long length) throws IOException {
+  public RecordReader 
llapVectorizedOrcReaderForPath(Object fileKey, Path path,
+  CacheTag tag, List tableIncludedCols, JobConf conf, long 
offset, long length, Reporter reporter)
+  throws IOException {
 
-OrcTail tail = getOrcTailFromCache(path, conf, tag, fileKey);
+OrcTail tail = null;
+if (tag != null) {

Review comment:
   - getOrcTailFromCache should always be called with a non-null CacheTag 
instance.
   - to produce a record reader in llapVectorizedOrcReaderForPath we either 
know the tag or not depending the source of invocation
   - - in case tag is known: we can do the metadata lookup and attach the 
result to the OrcSplit
   - - in case it's unknown: we create the split without the tail info, and let 
OrcEncodedDataReader find out the metadata later
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645250)
Time Spent: 2h 20m  (was: 2h 10m)

> Add LLAP IO support for Iceberg ORC tables
> --
>
> Key: HIVE-25453
> URL: https://issues.apache.org/jira/browse/HIVE-25453
> Project: Hive
>  Issue Type: New Feature
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25453) Add LLAP IO support for Iceberg ORC tables

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25453?focusedWorklogId=645247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645247
 ]

ASF GitHub Bot logged work on HIVE-25453:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 12:46
Start Date: 01/Sep/21 12:46
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #2586:
URL: https://github.com/apache/hive/pull/2586#discussion_r700177549



##
File path: itests/src/test/resources/testconfiguration.properties
##
@@ -1251,4 +1251,11 @@ erasurecoding.only.query.files=\
 # tests that requires external database connection
 externalDB.llap.query.files=\
   dataconnector.q,\
-  dataconnector_mysql.q
\ No newline at end of file
+  dataconnector_mysql.q
+
+iceberg.llap.query.files=\

Review comment:
   The reason I introduced this like that was to execute 
vectorized_iceberg_read.q on both LLAP and non-LLAP configs, without copying it 
over into another directory. Also, we only have very few iceberg q files yet, 
and just one other LLAP related q file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645247)
Time Spent: 2h 10m  (was: 2h)

> Add LLAP IO support for Iceberg ORC tables
> --
>
> Key: HIVE-25453
> URL: https://issues.apache.org/jira/browse/HIVE-25453
> Project: Hive
>  Issue Type: New Feature
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-20828) Upgrade to Spark 2.4.0

2021-09-01 Thread GuangMing Lu (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-20828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408106#comment-17408106
 ] 

GuangMing Lu commented on HIVE-20828:
-

Hi [~stakiar], How is the hive on spark evolving?

> Upgrade to Spark 2.4.0
> --
>
> Key: HIVE-20828
> URL: https://issues.apache.org/jira/browse/HIVE-20828
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Sahil Takiar
>Priority: Major
> Attachments: HIVE-20828.1.patch, HIVE-20828.2.patch
>
>
> The Spark community is in the process of releasing Spark 2.4.0. We should do 
> some testing with the RC candidates and then upgrade once the release is 
> finalized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23633) Metastore some JDO query objects do not close properly

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23633?focusedWorklogId=645214=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645214
 ]

ASF GitHub Bot logged work on HIVE-23633:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 10:52
Start Date: 01/Sep/21 10:52
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on pull request #2344:
URL: https://github.com/apache/hive/pull/2344#issuecomment-910165905


   > Thanks for catching this!
   > 
   > We should close all the queries, even if there is an exception.
   > Since Java 8 closing queries in `try-with-resources` would be the best if 
possible, falling back to `finally` when not possible is the second best.
   
   Thank you for the review!
   Introduces a new QueryWrapper to avoid superfluous exception handing when 
using `try-with-resources` with the class `Query`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645214)
Time Spent: 9h 20m  (was: 9h 10m)

> Metastore some JDO query objects do not close properly
> --
>
> Key: HIVE-23633
> URL: https://issues.apache.org/jira/browse/HIVE-23633
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23633.01.patch
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> After patched [HIVE-10895|https://issues.apache.org/jira/browse/HIVE-10895],  
> The metastore still has seen a memory leak on db resources: many 
> StatementImpls left unclosed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25480) Fix Time Travel with CBO

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25480?focusedWorklogId=645158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645158
 ]

ASF GitHub Bot logged work on HIVE-25480:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:44
Start Date: 01/Sep/21 09:44
Worklog Time Spent: 10m 
  Work Description: pvary merged pull request #2602:
URL: https://github.com/apache/hive/pull/2602


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645158)
Time Spent: 40m  (was: 0.5h)

> Fix Time Travel with CBO
> 
>
> Key: HIVE-25480
> URL: https://issues.apache.org/jira/browse/HIVE-25480
> Project: Hive
>  Issue Type: Bug
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When CBO is enable the Time Travel features are not working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24762) StringValueBoundaryScanner ignores boundary which leads to incorrect results

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24762?focusedWorklogId=645137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645137
 ]

ASF GitHub Bot logged work on HIVE-24762:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:41
Start Date: 01/Sep/21 09:41
Worklog Time Spent: 10m 
  Work Description: abstractdog opened a new pull request #1965:
URL: https://github.com/apache/hive/pull/1965


   ### What changes were proposed in this pull request?
   StringValueBoundaryScanner.isDistanceGreater to take amt into account.
   
   
   ### Why are the changes needed?
   Described in jira.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Added string based range window to ptf.q.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645137)
Time Spent: 1h 40m  (was: 1.5h)

>  StringValueBoundaryScanner ignores boundary which leads to incorrect results
> -
>
> Key: HIVE-24762
> URL: https://issues.apache.org/jira/browse/HIVE-24762
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java#L901
> {code}
>   public boolean isDistanceGreater(Object v1, Object v2, int amt) {
> ...
> return s1 != null && s2 != null && s1.compareTo(s2) > 0;
> {code}
> Like other boundary scanners, StringValueBoundaryScanner should take amt into 
> account, otherwise it'll result in the same range regardless of the given 
> window size. This typically affects queries where the range is defined on a 
> string column:
> {code}
> select p_mfgr, p_name, p_retailprice,
> count(*) over(partition by p_mfgr order by p_name range between 1 preceding 
> and current row) as cs1,
> count(*) over(partition by p_mfgr order by p_name range between 3 preceding 
> and current row) as cs2
> from vector_ptf_part_simple_orc;
> {code} 
> with "> 0" cs1 and cs2 will be calculated on the same window, so cs1 == cs2, 
> but actually it should be different, this is the correct result (see "almond 
> antique olive coral navajo"):
> {code}
> +-+-+--+--+
> | p_mfgr  |   p_name| cs1  | cs2  
> |
> +-+-+--+--+
> | Manufacturer#1  | almond antique burnished rose metallic  | 2| 2
> |
> | Manufacturer#1  | almond antique burnished rose metallic  | 2| 2
> |
> | Manufacturer#1  | almond antique chartreuse lavender yellow   | 6| 6
> |
> | Manufacturer#1  | almond antique chartreuse lavender yellow   | 6| 6
> |
> | Manufacturer#1  | almond antique chartreuse lavender yellow   | 6| 6
> |
> | Manufacturer#1  | almond antique chartreuse lavender yellow   | 6| 6
> |
> | Manufacturer#1  | almond antique salmon chartreuse burlywood  | 1| 1
> |
> | Manufacturer#1  | almond aquamarine burnished black steel | 1| 8
> |
> | Manufacturer#1  | almond aquamarine pink moccasin thistle | 4| 4
> |
> | Manufacturer#1  | almond aquamarine pink moccasin thistle | 4| 4
> |
> | Manufacturer#1  | almond aquamarine pink moccasin thistle | 4| 4
> |
> | Manufacturer#1  | almond aquamarine pink moccasin thistle | 4| 4
> |
> | Manufacturer#2  | almond antique violet chocolate turquoise   | 1| 1
> |
> | Manufacturer#2  | almond antique violet turquoise frosted | 3| 3
> |
> | Manufacturer#2  | almond antique violet turquoise frosted | 3| 3
> |
> | Manufacturer#2  | almond antique violet turquoise frosted | 3| 3
> |
> | Manufacturer#2  | almond aquamarine midnight light salmon | 1| 5
> |
> | Manufacturer#2  | almond aquamarine rose maroon antique   | 2| 2
> |
> | Manufacturer#2  | almond aquamarine rose maroon antique   | 2| 2
> |
> | Manufacturer#2  | almond aquamarine sandy cyan gainsboro  | 3| 3
> |
> | Manufacturer#3  | almond antique chartreuse khaki white   | 1| 1
> |
> | Manufacturer#3  | almond antique forest lavender goldenrod| 4| 5
> |
> | 

[jira] [Work logged] (HIVE-23633) Metastore some JDO query objects do not close properly

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23633?focusedWorklogId=645104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645104
 ]

ASF GitHub Bot logged work on HIVE-23633:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:37
Start Date: 01/Sep/21 09:37
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2344:
URL: https://github.com/apache/hive/pull/2344#discussion_r699837099



##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##
@@ -8142,9 +8146,11 @@ private void dropPartitionAllColumnGrantsNoTxn(
   query.declareParameters("java.lang.String t1");
   mSecurityDCList = (List) query.execute(dcName);
 }
+try (Query q = query) {
 pm.retrieveAll(mSecurityDCList);

Review comment:
   NIT: formatting

##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##
@@ -8142,9 +8146,11 @@ private void dropPartitionAllColumnGrantsNoTxn(
   query.declareParameters("java.lang.String t1");
   mSecurityDCList = (List) query.execute(dcName);
 }
+try (Query q = query) {

Review comment:
   We might have an error during execution. Should we close the query there 
too? 

##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java
##
@@ -1454,12 +1455,14 @@ public ColumnStatistics getTableStats(final String 
catName, final String dbName,
   }
 };
 List list = Batchable.runBatched(batchSize, colNames, b);
+final ColumnStatistics result;
 if (list.isEmpty()) {
-  return null;
+  result = null;
+} else {
+  ColumnStatisticsDesc csd = new ColumnStatisticsDesc(true, dbName, 
tableName);
+  csd.setCatName(catName);
+  result = makeColumnStats(list, csd, 0, engine);
 }

Review comment:
   What happens if there is an exception? Should we close the query there 
too? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645104)
Time Spent: 9h 10m  (was: 9h)

> Metastore some JDO query objects do not close properly
> --
>
> Key: HIVE-23633
> URL: https://issues.apache.org/jira/browse/HIVE-23633
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23633.01.patch
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> After patched [HIVE-10895|https://issues.apache.org/jira/browse/HIVE-10895],  
> The metastore still has seen a memory leak on db resources: many 
> StatementImpls left unclosed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23633) Metastore some JDO query objects do not close properly

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23633?focusedWorklogId=645094=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645094
 ]

ASF GitHub Bot logged work on HIVE-23633:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:36
Start Date: 01/Sep/21 09:36
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on pull request #2344:
URL: https://github.com/apache/hive/pull/2344#issuecomment-909790783


   Hi @pvary, @nrg4878,  could you please take a look if have secs?
   Thanks!
   Zhihua Deng


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645094)
Time Spent: 9h  (was: 8h 50m)

> Metastore some JDO query objects do not close properly
> --
>
> Key: HIVE-23633
> URL: https://issues.apache.org/jira/browse/HIVE-23633
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23633.01.patch
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> After patched [HIVE-10895|https://issues.apache.org/jira/browse/HIVE-10895],  
> The metastore still has seen a memory leak on db resources: many 
> StatementImpls left unclosed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25303) CTAS hive.create.as.external.legacy tries to place data files in managed WH path

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25303?focusedWorklogId=645019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645019
 ]

ASF GitHub Bot logged work on HIVE-25303:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:27
Start Date: 01/Sep/21 09:27
Worklog Time Spent: 10m 
  Work Description: pvary commented on pull request #2442:
URL: https://github.com/apache/hive/pull/2442#issuecomment-909027286


   CC: @marton-bod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645019)
Time Spent: 2.5h  (was: 2h 20m)

> CTAS hive.create.as.external.legacy tries to place data files in managed WH 
> path
> 
>
> Key: HIVE-25303
> URL: https://issues.apache.org/jira/browse/HIVE-25303
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Standalone Metastore
>Reporter: Sai Hemanth Gantasala
>Assignee: Sai Hemanth Gantasala
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Under legacy table creation mode (hive.create.as.external.legacy=true), when 
> a database has been created in a specific LOCATION, in a session where that 
> database is Used, tables are created using the following command:
> {code:java}
> CREATE TABLE  AS SELECT {code}
> should inherit the HDFS path from the database's location. Instead, Hive is 
> trying to write the table data into 
> /warehouse/tablespace/managed/hive//
> +Design+: 
> In the CTAS query, first data is written in the target directory (which 
> happens in HS2) and then the table is created(This happens in HMS). So here 
> two decisions are being made i) target directory location ii) how the table 
> should be created (table type, sd e.t.c).
> When HS2 needs a target location that needs to be set, it'll make create 
> table dry run call to HMS (where table translation happens) and i) and ii) 
> decisions are made within HMS and returns table object. Then HS2 will use 
> this location set by HMS for placing the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23633) Metastore some JDO query objects do not close properly

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23633?focusedWorklogId=645012=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645012
 ]

ASF GitHub Bot logged work on HIVE-23633:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:26
Start Date: 01/Sep/21 09:26
Worklog Time Spent: 10m 
  Work Description: pvary commented on pull request #2344:
URL: https://github.com/apache/hive/pull/2344#issuecomment-909880973


   Thanks for catching this! 
   
   We should close all the queries, even if there is an exception.
   Since Java 8 closing queries in `try-with-resources` would be the best if 
possible, falling back to `finally` when not possible is the second best.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645012)
Time Spent: 8h 50m  (was: 8h 40m)

> Metastore some JDO query objects do not close properly
> --
>
> Key: HIVE-23633
> URL: https://issues.apache.org/jira/browse/HIVE-23633
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23633.01.patch
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> After patched [HIVE-10895|https://issues.apache.org/jira/browse/HIVE-10895],  
> The metastore still has seen a memory leak on db resources: many 
> StatementImpls left unclosed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24590) Operation Logging still leaks the log4j Appenders

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24590?focusedWorklogId=645008=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645008
 ]

ASF GitHub Bot logged work on HIVE-24590:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:25
Start Date: 01/Sep/21 09:25
Worklog Time Spent: 10m 
  Work Description: zabetak commented on pull request #2432:
URL: https://github.com/apache/hive/pull/2432#issuecomment-909110409


   Hey @prasanthj , can you have a final look on this for getting this in? It 
seems that more and more people are bumping into the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645008)
Time Spent: 3h 10m  (was: 3h)

> Operation Logging still leaks the log4j Appenders
> -
>
> Key: HIVE-24590
> URL: https://issues.apache.org/jira/browse/HIVE-24590
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Reporter: Eugene Chung
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2021-01-06 at 18.42.05.png, Screen Shot 
> 2021-01-06 at 18.42.24.png, Screen Shot 2021-01-06 at 18.42.55.png, Screen 
> Shot 2021-01-06 at 21.38.32.png, Screen Shot 2021-01-06 at 21.47.28.png, 
> Screen Shot 2021-01-08 at 21.01.40.png, add_debug_log_and_trace.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I'm using Hive 3.1.2 with options below.
>  * hive.server2.logging.operation.enabled=true
>  * hive.server2.logging.operation.level=VERBOSE
>  * hive.async.log.enabled=false
> I already know the ticket, https://issues.apache.org/jira/browse/HIVE-17128 
> but HS2 still leaks log4j RandomAccessFileManager.
> !Screen Shot 2021-01-06 at 18.42.05.png|width=756,height=197!
> I checked the operation log file which is not closed/deleted properly.
> !Screen Shot 2021-01-06 at 18.42.24.png|width=603,height=272!
> Then there's the log,
> {code:java}
> client.TezClient: Shutting down Tez Session, sessionName= {code}
> !Screen Shot 2021-01-06 at 18.42.55.png|width=1372,height=26!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25482) Add option to enable connectionLeak detection for Hikari datasource

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25482?focusedWorklogId=645000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645000
 ]

ASF GitHub Bot logged work on HIVE-25482:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:25
Start Date: 01/Sep/21 09:25
Worklog Time Spent: 10m 
  Work Description: rbalamohan merged pull request #2610:
URL: https://github.com/apache/hive/pull/2610


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 645000)
Time Spent: 2h 50m  (was: 2h 40m)

> Add option to enable connectionLeak detection for Hikari datasource
> ---
>
> Key: HIVE-25482
> URL: https://issues.apache.org/jira/browse/HIVE-25482
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Aleksandr Pashkovskii
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> There are corner cases where we observed connection leaks to DB.
>  
> It will be good to add an option to provide connection leak timeout parameter 
> in HikariCPDataSourceProvider.
> [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/datasource/HikariCPDataSourceProvider.java#L69]
> e.g following should help Hikari to warn about connection leak, when a 
> connection is not returned to the pool for 1 hour.
> {noformat}
> config.setLeakDetectionThreshold(3600*1000); {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25482) Add option to enable connectionLeak detection for Hikari datasource

2021-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25482?focusedWorklogId=644957=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-644957
 ]

ASF GitHub Bot logged work on HIVE-25482:
-

Author: ASF GitHub Bot
Created on: 01/Sep/21 09:21
Start Date: 01/Sep/21 09:21
Worklog Time Spent: 10m 
  Work Description: avpash43 commented on pull request #2610:
URL: https://github.com/apache/hive/pull/2610#issuecomment-909177024


   @rbalamohan , all tests has been finished successfully. Can we make merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 644957)
Time Spent: 2h 40m  (was: 2.5h)

> Add option to enable connectionLeak detection for Hikari datasource
> ---
>
> Key: HIVE-25482
> URL: https://issues.apache.org/jira/browse/HIVE-25482
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Aleksandr Pashkovskii
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> There are corner cases where we observed connection leaks to DB.
>  
> It will be good to add an option to provide connection leak timeout parameter 
> in HikariCPDataSourceProvider.
> [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/datasource/HikariCPDataSourceProvider.java#L69]
> e.g following should help Hikari to warn about connection leak, when a 
> connection is not returned to the pool for 1 hour.
> {noformat}
> config.setLeakDetectionThreshold(3600*1000); {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema (Ref: 
[code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
 On a parquet side, this missing field gets pruned and since this field belongs 
to struct type, it ends up creating a GroupColumnIO without any children. This 
causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Expected Result:  {{{color:#505f79}NULL toplevel{color}}}

 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema (Ref: 
[code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
 On a parquet side, this missing field gets pruned and since this field belongs 
to struct type, it ends creating a GroupColumnIO without any children. This 
causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Expected Result:  {{{color:#505f79}NULL toplevel{color}}}

 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Component/s: Parquet

> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> ---
>
> Key: HIVE-25494
> URL: https://issues.apache.org/jira/browse/HIVE-25494
> Project: Hive
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 3.1.2
>Reporter: Ganesha Shreedhara
>Priority: Major
> Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema (Ref: 
> [code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
>  On a parquet side, this missing field gets pruned and since this field 
> belongs to struct type, it ends creating a GroupColumnIO without any 
> children. This causes query to fail with IndexOutOfBoundsException, stack 
> trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When VectorizedParquetRecordReader is used
> {code:java}
> hive> set 

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Affects Version/s: 3.1.2

> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> ---
>
> Key: HIVE-25494
> URL: https://issues.apache.org/jira/browse/HIVE-25494
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: Ganesha Shreedhara
>Priority: Major
> Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema (Ref: 
> [code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
>  On a parquet side, this missing field gets pruned and since this field 
> belongs to struct type, it ends creating a GroupColumnIO without any 
> children. This causes query to fail with IndexOutOfBoundsException, stack 
> trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When VectorizedParquetRecordReader is used
> {code:java}
> hive> set hive.fetch.task.conversion=none;

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Labels: schema-evolution  (was: )

> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> ---
>
> Key: HIVE-25494
> URL: https://issues.apache.org/jira/browse/HIVE-25494
> Project: Hive
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 3.1.2
>Reporter: Ganesha Shreedhara
>Priority: Major
>  Labels: schema-evolution
> Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema (Ref: 
> [code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
>  On a parquet side, this missing field gets pruned and since this field 
> belongs to struct type, it ends creating a GroupColumnIO without any 
> children. This causes query to fail with IndexOutOfBoundsException, stack 
> trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When 

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema (Ref: 
[code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
 On a parquet side, this missing field gets pruned and since this field belongs 
to struct type, it ends creating a GroupColumnIO without any children. This 
causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       
0--
VERTICES: 01/01  [==>>] 

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema (Ref 
[code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
 On a parquet side, this missing field gets pruned and since this field belongs 
to struct type, it ends creating a GroupColumnIO without any children. This 
causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       
0--
VERTICES: 01/01  [==>>] 

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema 
([Ref|#L130]). On a parquet side, this missing field gets pruned and since this 
field belongs to struct type, it ends creating a GroupColumnIO without any 
children. This causes query to fail with IndexOutOfBoundsException, stack trace 
is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       
0--
VERTICES: 01/01  [==>>] 100%  ELAPSED TIME: 3.06 
s--
OK
NULL 

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25494:
--
Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema 
([Ref|#L130]).]). On a parquet side, this missing field gets pruned and since 
this field belongs to struct type, it ends creating a GroupColumnIO without any 
children. This causes query to fail with IndexOutOfBoundsException, stack trace 
is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       
0--
VERTICES: 01/01  [==>>] 100%  ELAPSED TIME: 3.06 
s--
OK
NULL 

[jira] [Commented] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407866#comment-17407866
 ] 

Ganesha Shreedhara commented on HIVE-25494:
---

I verified that this issue doesn't exist when requestedSchema just has the 
field types that are present in the file schema. But, noticed that 
VectorizedParquetRecordReader gets all the fields and creates a 
VectorizedDummyColumnReader when a field is not present in file schema. When 
columns are accessed by names, should we consider only the fields that are 
present in the file schema to be present in requestedSchema and return null for 
the rest of the columns that are selected but missing in file schema? 

 

> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> ---
>
> Key: HIVE-25494
> URL: https://issues.apache.org/jira/browse/HIVE-25494
> Project: Hive
>  Issue Type: Bug
>Reporter: Ganesha Shreedhara
>Priority: Major
> Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema 
> ([Ref|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).]
>  On a parquet side, this missing field gets pruned and since this field 
> belongs to struct type, it ends creating a GroupColumnIO without any 
> children. This causes query to fail with IndexOutOfBoundsException, stack 
> trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value