date:20220428

[jira] [Assigned] (HIVE-26188) Query level cache and HMS local cache doesn't work locally and with Explain statements.

2022-04-28 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das reassigned HIVE-26188:
--


> Query level cache and HMS local cache doesn't work locally and with Explain 
> statements.
> ---
>
> Key: HIVE-26188
> URL: https://issues.apache.org/jira/browse/HIVE-26188
> Project: Hive
>  Issue Type: Bug
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Major
>
> {{ExplainSemanticAnalyzer}} should override {{startAnalysis()}} method that 
> creates the query level cache. This is important because after 
> https://issues.apache.org/jira/browse/HIVE-25918, the HMS local cache only 
> works if the query level cache is also initialized.
> Also, {{data/conf/llap/hive-site.xml}} properties for the HMS cache are 
> incorrect which should be fixed to enable the cache during qtests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26187) Set operations and time travel is not working

2022-04-28 Thread Jira



[ 
https://issues.apache.org/jira/browse/HIVE-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529590#comment-17529590
 ] 

Zoltán Borók-Nagy commented on HIVE-26187:
--

[~pvary] could you please take a look and assign it to someone?

> Set operations and time travel is not working
> -
>
> Key: HIVE-26187
> URL: https://issues.apache.org/jira/browse/HIVE-26187
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: iceberg
>
> Set operations doesn't work well with time travel queries.
> Repro:
> {noformat}
> select * from  t FOR SYSTEM_VERSION AS OF 
> MINUS
> select * from t FOR SYSTEM_VERSION AS OF ;
> {noformat}
> Returns 0 results because both selects use the same snapshot id, instead of 
> snapshot_id_1 and snapshot_id_2.
> Probably there're issues with other queries as well, when the same table is 
> used multiple times with different snapshot ids.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-21456) Hive Metastore Thrift over HTTP

2022-04-28 Thread Sourabh Goyal (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-21456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529579#comment-17529579
 ] 

Sourabh Goyal commented on HIVE-21456:
--

Merged upstream. Commit link: 
https://github.com/apache/hive/commit/b7da71856b1bb51af68a5ba6890b65f9843f3606

> Hive Metastore Thrift over HTTP
> ---
>
> Key: HIVE-21456
> URL: https://issues.apache.org/jira/browse/HIVE-21456
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Amit Khanna
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, 
> HIVE-21456.4.patch, HIVE-21456.patch
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Hive Metastore currently doesn't have support for HTTP transport because of 
> which it is not possible to access it via Knox. Adding support for Thrift 
> over HTTP transport will allow the clients to access via Knox



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-21456) Hive Metastore Thrift over HTTP

2022-04-28 Thread Sourabh Goyal (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-21456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Goyal updated HIVE-21456:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Hive Metastore Thrift over HTTP
> ---
>
> Key: HIVE-21456
> URL: https://issues.apache.org/jira/browse/HIVE-21456
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Amit Khanna
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, 
> HIVE-21456.4.patch, HIVE-21456.patch
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Hive Metastore currently doesn't have support for HTTP transport because of 
> which it is not possible to access it via Knox. Adding support for Thrift 
> over HTTP transport will allow the clients to access via Knox



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26183) Create delete writer for the UPDATE statemens

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26183?focusedWorklogId=763612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763612
 ]

ASF GitHub Bot logged work on HIVE-26183:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 15:36
Start Date: 28/Apr/22 15:36
Worklog Time Spent: 10m 
  Work Description: pvary commented on code in PR #3251:
URL: https://github.com/apache/hive/pull/3251#discussion_r861037824


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.TreeSet;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Writable;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.InternalRecordWrapper;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.io.ClusteredPositionDeleteWriter;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileWriterFactory;
+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.io.PartitioningWriter;
+import org.apache.iceberg.mr.mapred.Container;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
+import org.apache.iceberg.util.Tasks;
+import org.roaringbitmap.longlong.Roaring64Bitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HiveIcebergBufferedDeleteWriter implements HiveIcebergWriter {

Review Comment:
   @marton-bod: Could you please check the comments please?



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.TreeSet;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Writable;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.InternalRecordWrapper;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.io.ClusteredPositionDeleteWriter;
+import

[jira] [Commented] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

2022-04-28 Thread Pierre Gramme (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529483#comment-17529483
 ] 

Pierre Gramme commented on HIVE-22670:
--

Don't know how to attach the Parquet file here, so here's a shell script to 
create it

 
{noformat}
cat | base64 -d > test-arrow-int-na.parquet << 'EOF'
UEFSMRUEFQAVAkwVABUEEgAAABUAFQ4VEiwVBBUEFQYVBhw2BAcYAgQAACZiHBUCGTUE
AAYZGAF4FQIWBBZUFlomJiYIHDYEABksFQQVBBUCABUAFQQVAgAAABUEFRAVFEwVBBUEEgAACBwB
AgAAABUAFRIVFiwVBBUEFQYVBhwYBAIYBAEWACgEAgAAABgEAQkgAgAA
AAQBAQMCJuICHBUCGTUEAAYZGAF5FQIWBBaYARagASbyASbCARwYBAIYBAEWACgEAgAA
ABgEAQAZLBUEFQQVAgAVABUEFQIVAhk8NQAYBnNjaGVtYRUEABUCJQIYAXgAFQIlAhgB
eQAWBBkcGSwmYhwVAhk1BAAGGRgBeBUCFgQWVBZaJiYmCBw2BAAZLBUEFQQVAgAVABUEFQIm
4gIcFQIZNQQABhkYAXkVAhYEFpgBFqABJvIBJsIBHBgEAgAAABgEAQAAABYAKAQCGAQB
ABksFQQVBBUCABUAFQQVAgAAABbsARYEJggW+gEUAAAZHBgMQVJST1c6c2NoZW1hGOwBLy8vLy82
Z0FBQUFRQUFBQUFBQUtBQXdBQmdBRkFBZ0FDZ0FBQUFBQkJBQU1BQUFBQ0FBSUFBQUFCQUFJQUFB
QUJBQUFBQUlBQUFCRUFBQUFCQUFBQU5ULy8vOEFBQUVDRUFBQUFCUUFBQUFFQUFBQUFBQUFBQUVB
QUFCNUFBQUF4UC8vL3dBQUFBRWdBQUFBRUFBVUFBZ0FCZ0FIQUF3QUFBQVFBQkFBQUFBQUFBRUNF
QUFBQUJ3QUFBQUVBQUFBQUFBQUFBRUFBQUI0QUFBQUNBQU1BQWdBQndBSUFBQUFBQUFBQVNBQUFB
QT0AGB9wYXJxdWV0LWNwcC1hcnJvdyB2ZXJzaW9uIDcuMC4wGSwcAAAc2wEAAFBBUjE=
EOF{noformat}
 

 

> ArrayIndexOutOfBoundsException when vectorized reader is used for reading a 
> parquet file
> 
>
> Key: HIVE-22670
> URL: https://issues.apache.org/jira/browse/HIVE-22670
> Project: Hive
>  Issue Type: Bug
>  Components: Parquet, Vectorization
>Affects Versions: 2.3.6, 3.1.2
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-22670.1.patch, HIVE-22670.2.patch
>
>
> ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds 
> of a row group in parquet file with vectorization enabled. 
> *Exception stack trace:*
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>  at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92)
>  at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
>  ... 24 more{code}
>  
> This issue seems to be caused by re-using the same dictionary column vector 
> while reading consecutive row groups. This looks like one of the corner case 
> bug which occurs for a certain distribution of dictionary/plain encoded data 
> while we read/populate the underlying bit packed dictionary data into a 
> column-vector based data structure. 
> Similar issue issue was reported in spark (Ref: 
> https://issues.apache.org/jira/browse/SPARK-16334)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

2022-04-28 Thread Pierre Gramme (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529476#comment-17529476
 ] 

Pierre Gramme commented on HIVE-22670:
--

Hi

I encountered the same problem as [~ganeshas]. I was able to narrow it down to 
a minimal reproducible example, cf attachment.

My Parquet file is generated with Apache Arrow 7.0.0, using the R API (but I 
don't think it is relevant):

{{  arrow::write_parquet(tibble::tibble(x=NA_integer_, y=1:2), 
"test-arrow-int-na.parquet")}}

So table has 2 variables x and y, both integers, 

 
||x||y||
|NULL|1|
|NULL|2|

 

 
{noformat}
create external table test_parquet_na (x integer, y integer) stored as parquet 
location 'hdfs:///path/to/test_parquet_na/';

-- The following works as expected:
set hive.vectorized.execution.enabled=false;
select * from test_parquet_na;
select * from test_parquet_na order by y;

-- This also works:
set hive.vectorized.execution.enabled=true;
select * from test_parquet_na;

-- But this crashes:
set hive.vectorized.execution.enabled=true; 
select * from test_parquet_na order by y;
-- => ERROR: same as OP,
-- Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
--         at 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary.decodeToInt(PlainValuesDictionary.java:251)
--         at 
org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readInteger(ParquetDataColumnReaderFactory.java:182)
  
-- ...{noformat}
Note: I did my tests on a HDP cluster, Apache Hive (version 3.1.0.3.1.5.0-152). 
Can't easily test on more recent version, sorry

> ArrayIndexOutOfBoundsException when vectorized reader is used for reading a 
> parquet file
> 
>
> Key: HIVE-22670
> URL: https://issues.apache.org/jira/browse/HIVE-22670
> Project: Hive
>  Issue Type: Bug
>  Components: Parquet, Vectorization
>Affects Versions: 2.3.6, 3.1.2
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-22670.1.patch, HIVE-22670.2.patch
>
>
> ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds 
> of a row group in parquet file with vectorization enabled. 
> *Exception stack trace:*
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>  at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92)
>  at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
>  ... 24 more{code}
>  
> This issue seems to be caused by re-using the same dictionary column vector 
> while reading consecutive row groups. This looks like one of the corner case 
> bug which occurs for a certain distribution of dictionary/plain encoded data 
> while we read/populate the underlying bit packed dictionary data into a 
> column-vector based data structure. 
> Similar issue issue was reported in spark (Ref: 
> https://issues.apache.org/jira/browse/SPARK-16334)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26158:
--
Labels: pull-request-available  (was: )

> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
>  
> in beeline:
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this result is null.
> select * from part_test11 where dat="20220101";
> ||part_test.c1||part_test.c2||part_test.dat||
> | | | |
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |*LOCATION*|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|
> ---
>  
> We need to modify the partition location of the table in SDS to ensure that 
> the query results are normal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26158?focusedWorklogId=763543=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763543
 ]

ASF GitHub Bot logged work on HIVE-26158:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 14:27
Start Date: 28/Apr/22 14:27
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk opened a new pull request, #3255:
URL: https://github.com/apache/hive/pull/3255

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   




Issue Time Tracking
---

Worklog Id: (was: 763543)
Remaining Estimate: 0h
Time Spent: 10m

> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Assignee: Zoltan Haindrich
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
>  
> in beeline:
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this result is null.
> select * from part_test11 where dat="20220101";
> ||part_test.c1||part_test.c2||part_test.dat||
> | | | |
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |*LOCATION*|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|
> ---
>  
> We need to modify the partition location of the table in SDS to ensure that 
> the query results are normal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26149) Non blocking DROP DATABASE implementation

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26149?focusedWorklogId=763484=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763484
 ]

ASF GitHub Bot logged work on HIVE-26149:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 12:47
Start Date: 28/Apr/22 12:47
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3220:
URL: https://github.com/apache/hive/pull/3220#discussion_r860846614


##
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java:
##
@@ -1869,8 +1880,7 @@ void dropDatabase(String catName, String dbName, boolean 
deleteData, boolean ign
* @throws MetaException something went wrong, usually either in the RDBMS 
or storage.
* @throws TException general thrift error.
*/
-  default void dropDatabase(String catName, String dbName, boolean deleteData,
-boolean ignoreUnknownDb)
+  default void dropDatabase(String catName, String dbName, boolean deleteData, 
boolean ignoreUnknownDb)

Review Comment:
   marked as deprecated





Issue Time Tracking
---

Worklog Id: (was: 763484)
Time Spent: 2.5h  (was: 2h 20m)

> Non blocking DROP DATABASE implementation
> -
>
> Key: HIVE-26149
> URL: https://issues.apache.org/jira/browse/HIVE-26149
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26149) Non blocking DROP DATABASE implementation

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26149?focusedWorklogId=763483=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763483
 ]

ASF GitHub Bot logged work on HIVE-26149:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 12:46
Start Date: 28/Apr/22 12:46
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3220:
URL: https://github.com/apache/hive/pull/3220#discussion_r860846223


##
ql/src/java/org/apache/hadoop/hive/ql/ddl/database/drop/DropDatabaseAnalyzer.java:
##
@@ -49,28 +52,36 @@ public void analyzeInternal(ASTNode root) throws 
SemanticException {
 String databaseName = unescapeIdentifier(root.getChild(0).getText());
 boolean ifExists = root.getFirstChildWithType(HiveParser.TOK_IFEXISTS) != 
null;
 boolean cascade = root.getFirstChildWithType(HiveParser.TOK_CASCADE) != 
null;
+boolean isSoftDelete = HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_ACID_LOCKLESS_READS_ENABLED);
 
 Database database = getDatabase(databaseName, !ifExists);
 if (database == null) {
   return;
 }
-
 // if cascade=true, then we need to authorize the drop table action as 
well, and add the tables to the outputs
+boolean allTablesWithSuffix = false;
 if (cascade) {
   try {
-for (Table table : db.getAllTableObjects(databaseName)) {
-  // We want no lock here, as the database lock will cover the tables,
-  // and putting a lock will actually cause us to deadlock on 
ourselves.
-  outputs.add(new WriteEntity(table, 
WriteEntity.WriteType.DDL_NO_LOCK));
+List tables = db.getAllTableObjects(databaseName);
+allTablesWithSuffix = tables.stream().allMatch(
+table -> AcidUtils.isTableSoftDeleteEnabled(table, conf));
+for (Table table : tables) {
+  // Optimization used to limit number of requested locks. Check if 
table lock is needed or we could get away with single DB level lock,
+  boolean isTableLockNeeded = isSoftDelete && !allTablesWithSuffix;
+  outputs.add(new WriteEntity(table, isTableLockNeeded ?
+AcidUtils.isTableSoftDeleteEnabled(table, conf) ?
+WriteEntity.WriteType.DDL_EXCL_WRITE : 
WriteEntity.WriteType.DDL_EXCLUSIVE :
+WriteEntity.WriteType.DDL_NO_LOCK));

Review Comment:
   refactored





Issue Time Tracking
---

Worklog Id: (was: 763483)
Time Spent: 2h 20m  (was: 2h 10m)

> Non blocking DROP DATABASE implementation
> -
>
> Key: HIVE-26149
> URL: https://issues.apache.org/jira/browse/HIVE-26149
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26183) Create delete writer for the UPDATE statemens

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26183?focusedWorklogId=763479=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763479
 ]

ASF GitHub Bot logged work on HIVE-26183:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 12:37
Start Date: 28/Apr/22 12:37
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3251:
URL: https://github.com/apache/hive/pull/3251#discussion_r860837013


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.TreeSet;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Writable;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.InternalRecordWrapper;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.io.ClusteredPositionDeleteWriter;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileWriterFactory;
+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.io.PartitioningWriter;
+import org.apache.iceberg.mr.mapred.Container;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
+import org.apache.iceberg.util.Tasks;
+import org.roaringbitmap.longlong.Roaring64Bitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HiveIcebergBufferedDeleteWriter implements HiveIcebergWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveIcebergBufferedDeleteWriter.class);
+
+  private static final String DELETE_FILE_THREAD_POOL_SIZE = 
"iceberg.delete.file.thread.pool.size";
+  private static final int DELETE_FILE_THREAD_POOL_SIZE_DEFAULT = 10;
+
+  // Storing deleted data in a map Partition -> FileName -> BitMap
+  private final Map> buffer = 
Maps.newHashMap();
+  private final Map specs;
+  private final Map keyToSpec = Maps.newHashMap();
+  private final FileFormat format;
+  private final FileWriterFactory writerFactory;
+  private final OutputFileFactory fileFactory;
+  private final FileIO io;
+  private final long targetFileSize;
+  private final Configuration configuration;
+  private final Record record;
+  private final InternalRecordWrapper wrapper;
+  private FilesForCommit filesForCommit;
+
+  HiveIcebergBufferedDeleteWriter(Schema schema, Map 
specs, FileFormat format,
+  FileWriterFactory writerFactory, OutputFileFactory fileFactory, 
FileIO io, long targetFileSize,
+  Configuration configuration) {
+this.specs = specs;
+this.format = format;
+this.writerFactory = writerFactory;
+this.fileFactory = fileFactory;
+this.io = io;
+this.targetFileSize = targetFileSize;
+this.configuration = configuration;
+this.wrapper = new InternalRecordWrapper(schema.asStruct());
+this.record = GenericRecord.create(schema);
+  }
+
+  @Override
+  public void write(Writable row) throws IOException {
+Record rec = ((Container) row).get();
+IcebergAcidUtil.getOriginalFromUpdatedRecord(rec, record);
+String filePath = (String) rec.getField(MetadataColumns.FILE_PATH.name());
+int specId = IcebergAcidUtil.parseSpecId(rec);
+
+Map deleteMap =
+buffer.computeIfAbsent(partition(record, specId), key -> {
+  keyToSpec.put(key, specs.get(specId));
+  return Maps.newHashMap();
+});
+Roaring64Bitmap

[jira] [Work logged] (HIVE-26183) Create delete writer for the UPDATE statemens

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26183?focusedWorklogId=763481=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763481
 ]

ASF GitHub Bot logged work on HIVE-26183:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 12:37
Start Date: 28/Apr/22 12:37
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3251:
URL: https://github.com/apache/hive/pull/3251#discussion_r860837013


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.TreeSet;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Writable;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.InternalRecordWrapper;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.io.ClusteredPositionDeleteWriter;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileWriterFactory;
+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.io.PartitioningWriter;
+import org.apache.iceberg.mr.mapred.Container;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
+import org.apache.iceberg.util.Tasks;
+import org.roaringbitmap.longlong.Roaring64Bitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HiveIcebergBufferedDeleteWriter implements HiveIcebergWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveIcebergBufferedDeleteWriter.class);
+
+  private static final String DELETE_FILE_THREAD_POOL_SIZE = 
"iceberg.delete.file.thread.pool.size";
+  private static final int DELETE_FILE_THREAD_POOL_SIZE_DEFAULT = 10;
+
+  // Storing deleted data in a map Partition -> FileName -> BitMap
+  private final Map> buffer = 
Maps.newHashMap();
+  private final Map specs;
+  private final Map keyToSpec = Maps.newHashMap();
+  private final FileFormat format;
+  private final FileWriterFactory writerFactory;
+  private final OutputFileFactory fileFactory;
+  private final FileIO io;
+  private final long targetFileSize;
+  private final Configuration configuration;
+  private final Record record;
+  private final InternalRecordWrapper wrapper;
+  private FilesForCommit filesForCommit;
+
+  HiveIcebergBufferedDeleteWriter(Schema schema, Map 
specs, FileFormat format,
+  FileWriterFactory writerFactory, OutputFileFactory fileFactory, 
FileIO io, long targetFileSize,
+  Configuration configuration) {
+this.specs = specs;
+this.format = format;
+this.writerFactory = writerFactory;
+this.fileFactory = fileFactory;
+this.io = io;
+this.targetFileSize = targetFileSize;
+this.configuration = configuration;
+this.wrapper = new InternalRecordWrapper(schema.asStruct());
+this.record = GenericRecord.create(schema);
+  }
+
+  @Override
+  public void write(Writable row) throws IOException {
+Record rec = ((Container) row).get();
+IcebergAcidUtil.getOriginalFromUpdatedRecord(rec, record);
+String filePath = (String) rec.getField(MetadataColumns.FILE_PATH.name());
+int specId = IcebergAcidUtil.parseSpecId(rec);
+
+Map deleteMap =
+buffer.computeIfAbsent(partition(record, specId), key -> {
+  keyToSpec.put(key, specs.get(specId));
+  return Maps.newHashMap();
+});
+Roaring64Bitmap

[jira] [Work logged] (HIVE-26183) Create delete writer for the UPDATE statemens

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26183?focusedWorklogId=763478=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763478
 ]

ASF GitHub Bot logged work on HIVE-26183:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 12:36
Start Date: 28/Apr/22 12:36
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3251:
URL: https://github.com/apache/hive/pull/3251#discussion_r860836342


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.TreeSet;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Writable;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.InternalRecordWrapper;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.io.ClusteredPositionDeleteWriter;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileWriterFactory;
+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.io.PartitioningWriter;
+import org.apache.iceberg.mr.mapred.Container;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
+import org.apache.iceberg.util.Tasks;
+import org.roaringbitmap.longlong.Roaring64Bitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HiveIcebergBufferedDeleteWriter implements HiveIcebergWriter {

Review Comment:
   Can we some javadoc detailing how its behaviour differs from the vanilla 
DeleteWriter?





Issue Time Tracking
---

Worklog Id: (was: 763478)
Time Spent: 20m  (was: 10m)

> Create delete writer for the UPDATE statemens
> -
>
> Key: HIVE-26183
> URL: https://issues.apache.org/jira/browse/HIVE-26183
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> During the investigation of the updates of partitioned table we had the 
> following issue:
> - Iceberg inserts are needed to be sorted by the new partition keys
> - Iceberg deletes are needed to be sorted by the old partition keys and 
> filenames
> This could contradict each other. OTOH Hive updates create a single query and 
> writes out the insert/delete record for ever row. This would mean plenty of 
> open writers.
> We might want to create something like a 
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java,
>  but we do not want to keep the whole rows in memory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26179) In tez reuse container mode, asyncInitOperations are not clear.

2022-04-28 Thread zhengchenyu (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529404#comment-17529404
 ] 

zhengchenyu commented on HIVE-26179:


[~zabetak] I use our internal version based on hive-1.2.1. I don't have 
4.0.0-alpah-1 environment. But I read the master source code, I found same 
logical problem. asyncInitOperations need to clear when close op, or will 
result to inconsistency in tez reuse container mode.

> In tez reuse container mode, asyncInitOperations are not clear.
> ---
>
> Key: HIVE-26179
> URL: https://issues.apache.org/jira/browse/HIVE-26179
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Tez
>Affects Versions: 1.2.1
> Environment: engine: Tez (Note: tez.am.container.reuse.enabled is 
> true)
>  
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our cluster, we found error like this.
> {code:java}
> Vertex failed, vertexName=Map 1, vertexId=vertex_1650608671415_321290_1_11, 
> diagnostics=[Task failed, taskId=task_1650608671415_321290_1_11_000422, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1650608671415_321290_1_11_000422_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:135)
>     at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>     at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>     at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>     at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>     at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators
>     at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:349)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:161)
>     ... 16 more
> Caused by: java.lang.NullPointerException
>     at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.closeOp(MapJoinOperator.java:488)
>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:684)
>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:698)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:338)
>     ... 17 more
> {code}
> When tez reuse container is enable, and use MapJoinOperator, if same tasks's 
> different taskattemp execute in same container, will throw NPE.
> By my debug, I found the second task attempt use first task's 
> asyncInitOperations. asyncInitOperations are not clear when close op, then 
> second taskattemp may use first taskattepmt's mapJoinTables which 
> HybridHashTableContainer.HashPartition is closed, so throw NPE.
> We must clear asyncInitOperations when op is closed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529389#comment-17529389
 ] 

okumin commented on HIVE-26184:
---

[~kgyrtkirk] 

Thanks. I wanted to say something gets wrong when a highly skewed key(100,000 
of `----` in the example) coexists with 
non-skewed keys(5,000,000 of unique UUIDs). I also put a comment in the PR and 
please feel free to ask me if that doesn't make sense.

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763448
 ]

ASF GitHub Bot logged work on HIVE-26184:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 11:44
Start Date: 28/Apr/22 11:44
Worklog Time Spent: 10m 
  Work Description: okumin commented on code in PR #3253:
URL: https://github.com/apache/hive/pull/3253#discussion_r860788021


##
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java:
##
@@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() {
 throw new RuntimeException("Buffer type unknown");
   }
 }
+
+private void reset() {
+  if (bufferType == BufferType.LIST) {
+container.clear();
+  } else if (bufferType == BufferType.SET) {
+// Don't reuse a container because HashSet#clear can be very slow. The 
operation takes O(N)

Review Comment:
   @kgyrtkirk Thanks for your quick review!
   
   I wanted to mean skew of GROUP BY keys here, not elements of HashSet. Let me 
illustrate that with the following query, mapping articles into their comments. 
If a certain article accidentally gets very popular, it has much more comments 
than the others. My `skew` means such situation.
   
   ```
   SELECT article_id, COLLECT_SET(comment) FROM comments GROUP BY article_id
   ```
   
   The capacity of the internal hash table of 
`MkArrayAggregationBuffer#container` will eventually grow as much as it can 
retain all comments tied to the most skewed article so far. Also, the internal 
hash table will never get smaller because resizing happens only when new 
entries are added(precisely, this point depends on the implementation of JDK).
   From the nature of hash tables, the duration of `HashSet#clear` relies on 
the capacity of an internal hash table. It's an operation to fill all cells 
with NULLs.
   
   Because of the two points, GroupByOperator suddenly slows down once it 
processes a skewed key. For example, assuming the first `article_id=1` has 
1,000,000 comments, `GenericUDAFMkCollectionEvaluator#reset` has to fill a very 
big hash table with many NULLs every time even if all following 
articles(`article_id=2`, `article_id=3`, ...) have 0 comments.





Issue Time Tracking
---

Worklog Id: (was: 763448)
Time Spent: 40m  (was: 0.5h)

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-26184:
--
Description: 
I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM table_with_many_rows
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}

  was:
I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM sample_datasets.nasdaq
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}


> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763444=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763444
 ]

ASF GitHub Bot logged work on HIVE-26184:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 11:42
Start Date: 28/Apr/22 11:42
Worklog Time Spent: 10m 
  Work Description: okumin commented on code in PR #3253:
URL: https://github.com/apache/hive/pull/3253#discussion_r860788021


##
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java:
##
@@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() {
 throw new RuntimeException("Buffer type unknown");
   }
 }
+
+private void reset() {
+  if (bufferType == BufferType.LIST) {
+container.clear();
+  } else if (bufferType == BufferType.SET) {
+// Don't reuse a container because HashSet#clear can be very slow. The 
operation takes O(N)

Review Comment:
   @kgyrtkirk Thanks for your quick review!
   
   I wanted to mean skew of GROUP BY keys here, not elements of HashSet. Let me 
illustrate that with the following query, mapping articles into their comments. 
If a certain article accidentally gets very popular, it has much more comments 
than the others. My `skew` means such situation.
   
   ```
   SELECT article_id, COLLECT_SET(comment) FROM comments GROUP BY article_id
   ```
   
   The capacity of the internal hash table of 
`MkArrayAggregationBuffer#container` will eventually grow as much as it can 
retain all comments tied to the most skewed article so far. Also, the internal 
hash table will never get smaller because resizing happens only when new 
entries are added(precisely, this point depends on the implementation of JDK).
   From the nature of hash tables, the duration of `HashSet#clear` relies on 
the capacity of an internal hash table. It's an operation to fill all cells 
with NULLs.
   
   Because of the two points, GroupByOperator suddenly slows down once it 
processes a skewed key. For example, assuming the first `article_id=1` has 
1,000,000 comments, `GenericUDAFMkCollectionEvaluator#reset` has to fill a very 
big hash table with many NULLs every time even if all following articles have 0 
comments.





Issue Time Tracking
---

Worklog Id: (was: 763444)
Time Spent: 0.5h  (was: 20m)

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread Zoltan Haindrich (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529368#comment-17529368
 ] 

Zoltan Haindrich commented on HIVE-26184:
-

because the value will be the same - I think collecting any number of them into 
a SET will not make the key for it overload - unless the hashCode of that UUID 
value is always the same constant...but in that case we should fix that - 
because it will make slow all the other operations; including `contains`

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763422=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763422
 ]

ASF GitHub Bot logged work on HIVE-26184:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 10:52
Start Date: 28/Apr/22 10:52
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on code in PR #3253:
URL: https://github.com/apache/hive/pull/3253#discussion_r860749534


##
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java:
##
@@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() {
 throw new RuntimeException("Buffer type unknown");
   }
 }
+
+private void reset() {
+  if (bufferType == BufferType.LIST) {
+container.clear();
+  } else if (bufferType == BufferType.SET) {
+// Don't reuse a container because HashSet#clear can be very slow. The 
operation takes O(N)

Review Comment:
   why did the entries got skewed in the firstplace? don't we miss or have 
incorrect implementation of some `hashCode()` method?
   
   could you please add a testcase which reproduces the issue?
   maybe you could probably write a test against the UDF itself..





Issue Time Tracking
---

Worklog Id: (was: 763422)
Time Spent: 20m  (was: 10m)

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26006) TopNKey and PTF with more than one column is failing with IOBE

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26006?focusedWorklogId=763419=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763419
 ]

ASF GitHub Bot logged work on HIVE-26006:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 10:40
Start Date: 28/Apr/22 10:40
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #3082:
URL: https://github.com/apache/hive/pull/3082#discussion_r860699101


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyPushdownProcessor.java:
##
@@ -244,13 +223,35 @@ private void pushdownThroughLeftOuterJoin(TopNKeyOperator 
topNKey) throws Semant
 reduceSinkDesc.getColumnExprMap(),
 reduceSinkDesc.getOrder(),
 reduceSinkDesc.getNullOrder());
+
+pushDownThrough(commonKeyPrefix, topNKey, join, reduceSinkOperator);
+  }
+
+  private  void pushDownThrough(
+  CommonKeyPrefix commonKeyPrefix, TopNKeyOperator topNKey, 
Operator operator)
+  throws SemanticException {
+
+pushDownThrough(commonKeyPrefix, topNKey, operator, operator);
+  }
+
+  private  void pushDownThrough(
+  CommonKeyPrefix commonKeyPrefix, TopNKeyOperator topNKey,
+  Operator join, Operator reduceSinkOperator)
+  throws SemanticException {
+
+final TopNKeyDesc topNKeyDesc = topNKey.getConf();
 if (commonKeyPrefix.isEmpty() || commonKeyPrefix.size() == 
topNKeyDesc.getPartitionKeyColumns().size()) {
   return;
 }
 
+final TopNKeyDesc newTopNKeyDesc = topNKeyDesc.combine(commonKeyPrefix);
+if (newTopNKeyDesc.getKeyColumns().size() > 0 &&
+newTopNKeyDesc.getKeyColumns().size() <= 
newTopNKeyDesc.getPartitionKeyColumns().size()) {

Review Comment:
   Do we need to create the new `TopNKeyDesc` to do this check? Don't we have 
already all the info?
   
   Can you add more comments on why we need to bail out.
   
   Do we have test coverage for this case. In other words does existing test 
enter this new if statement?



##
ql/src/java/org/apache/hadoop/hive/ql/plan/TopNKeyDesc.java:
##
@@ -252,7 +252,8 @@ public TopNKeyDescExplainVectorization 
getTopNKeyVectorization() {
   public TopNKeyDesc combine(CommonKeyPrefix commonKeyPrefix) {
 return new TopNKeyDesc(topN, commonKeyPrefix.getMappedOrder(),
 commonKeyPrefix.getMappedNullOrder(), 
commonKeyPrefix.getMappedColumns(),
-commonKeyPrefix.getMappedColumns().subList(0, 
partitionKeyColumns.size()),
+commonKeyPrefix.getMappedColumns()
+.subList(0, Math.min(partitionKeyColumns.size(), 
commonKeyPrefix.getMappedColumns().size())),

Review Comment:
   This is the main part of the fix right? The rest is mostly refactoring to 
take advantage of the new bail-out condition?



##
ql/src/test/queries/clientpositive/ptf_tnk.q:
##
@@ -0,0 +1,22 @@
+CREATE EXTERNAL TABLE t1(

Review Comment:
   Would it be possible to also load some data and verify that the results of 
the query are correct?



##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyPushdownProcessor.java:
##
@@ -244,13 +223,35 @@ private void pushdownThroughLeftOuterJoin(TopNKeyOperator 
topNKey) throws Semant
 reduceSinkDesc.getColumnExprMap(),
 reduceSinkDesc.getOrder(),
 reduceSinkDesc.getNullOrder());
+
+pushDownThrough(commonKeyPrefix, topNKey, join, reduceSinkOperator);
+  }
+
+  private  void pushDownThrough(
+  CommonKeyPrefix commonKeyPrefix, TopNKeyOperator topNKey, 
Operator operator)
+  throws SemanticException {
+
+pushDownThrough(commonKeyPrefix, topNKey, operator, operator);
+  }
+
+  private  void pushDownThrough(
+  CommonKeyPrefix commonKeyPrefix, TopNKeyOperator topNKey,
+  Operator join, Operator reduceSinkOperator)

Review Comment:
   Are the operators here strictly a join and reduce sink? From the code I get 
the impression that there are more options. Should we pick more descriptive 
names?





Issue Time Tracking
---

Worklog Id: (was: 763419)
Time Spent: 20m  (was: 10m)

> TopNKey and PTF with more than one column is failing with IOBE
> --
>
> Key: HIVE-26006
> URL: https://issues.apache.org/jira/browse/HIVE-26006
> Project: Hive
>  Issue Type: Bug
>Reporter: Naresh P R
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IndexOutOfBoundsException: toIndex = 2
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:1014)
> at java.util.ArrayList.subList(ArrayList.java:1006)
> at

[jira] [Resolved] (HIVE-26176) Create a new connection pool for compaction (CompactionTxnHandler)

2022-04-28 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits resolved HIVE-26176.

Resolution: Fixed

Pushed to master. Thanks for the review [~dkuzmenko] and [~pvary]

> Create a new connection pool for compaction (CompactionTxnHandler)
> --
>
> Key: HIVE-26176
> URL: https://issues.apache.org/jira/browse/HIVE-26176
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26155) Create a new connection pool for compaction

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26155?focusedWorklogId=763414=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763414
 ]

ASF GitHub Bot logged work on HIVE-26155:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 10:33
Start Date: 28/Apr/22 10:33
Worklog Time Spent: 10m 
  Work Description: asinkovits merged PR #3223:
URL: https://github.com/apache/hive/pull/3223




Issue Time Tracking
---

Worklog Id: (was: 763414)
Time Spent: 1h 10m  (was: 1h)

> Create a new connection pool for compaction
> ---
>
> Key: HIVE-26155
> URL: https://issues.apache.org/jira/browse/HIVE-26155
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>  Labels: compaction, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the TxnHandler uses 2 connection pools to communicate with the HMS: 
> the default one and one for mutexing. If compaction is configured incorrectly 
> (e.g. too many Initiators are running on the same db) then compaction can use 
> up all the connections in the default connection pool and all user queries 
> can get stuck.
> We should have a separate connection pool (configurable size) just for 
> compaction-related activities.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763397=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763397
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:55
Start Date: 28/Apr/22 09:55
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860674098


##
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java:
##
@@ -197,6 +198,9 @@ static void gatherStats(CompactionInfo ci, HiveConf conf, 
String userName, Strin
   statusUpdaterConf.set(TezConfiguration.TEZ_QUEUE_NAME, 
compactionQueueName);
 }
 SessionState sessionState = 
DriverUtils.setUpSessionState(statusUpdaterConf, userName, true);
+Map hiveVariables = sessionState.getHiveVariables();
+hiveVariables.put(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG, 
"true");
+sessionState.setHiveVariables(hiveVariables);

Review Comment:
   could we create sessionState.setHiveVariable() method in SessionState?





Issue Time Tracking
---

Worklog Id: (was: 763397)
Time Spent: 2.5h  (was: 2h 20m)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763393=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763393
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:49
Start Date: 28/Apr/22 09:49
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860697805


##
ql/src/java/org/apache/hadoop/hive/ql/DriverTxnHandler.java:
##
@@ -303,8 +303,15 @@ void setWriteIdForAcidFileSinks() throws 
SemanticException, LockException {
 
   private void allocateWriteIdForAcidAnalyzeTable() throws LockException {
 if (driverContext.getPlan().getAcidAnalyzeTable() != null) {
+  //Inside a compaction transaction, only stats gathering is running which 
is not requiring a new write id,
+  //and for duplicate compaction detection it is necessary to not 
increment it.
+  boolean isWithinCompactionTxn = 
Boolean.parseBoolean(SessionState.get().getHiveVariables().get(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG));
   Table table = driverContext.getPlan().getAcidAnalyzeTable().getTable();
-  driverContext.getTxnManager().getTableWriteId(table.getDbName(), 
table.getTableName());
+  if(isWithinCompactionTxn) {

Review Comment:
   could be replaced with  if (driverContext.getCompactionWriteIds() != null) 





Issue Time Tracking
---

Worklog Id: (was: 763393)
Time Spent: 2h 20m  (was: 2h 10m)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763391=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763391
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:47
Start Date: 28/Apr/22 09:47
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860693661


##
ql/src/java/org/apache/hadoop/hive/ql/DriverTxnHandler.java:
##
@@ -303,8 +303,15 @@ void setWriteIdForAcidFileSinks() throws 
SemanticException, LockException {
 
   private void allocateWriteIdForAcidAnalyzeTable() throws LockException {
 if (driverContext.getPlan().getAcidAnalyzeTable() != null) {
+  //Inside a compaction transaction, only stats gathering is running which 
is not requiring a new write id,
+  //and for duplicate compaction detection it is necessary to not 
increment it.
+  boolean isWithinCompactionTxn = 
Boolean.parseBoolean(SessionState.get().getHiveVariables().get(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG));
   Table table = driverContext.getPlan().getAcidAnalyzeTable().getTable();
-  driverContext.getTxnManager().getTableWriteId(table.getDbName(), 
table.getTableName());
+  if(isWithinCompactionTxn) {
+
driverContext.getTxnManager().allocateMaxTableWriteId(table.getDbName(), 
table.getTableName());

Review Comment:
   instead of this could we supply compaction HWM here to avoid db call? they 
should be already present in DriverContext
   
   driverContext.setCompactionWriteIds(compactionWriteIds);
   





Issue Time Tracking
---

Worklog Id: (was: 763391)
Time Spent: 2h 10m  (was: 2h)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763390=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763390
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:44
Start Date: 28/Apr/22 09:44
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860693661


##
ql/src/java/org/apache/hadoop/hive/ql/DriverTxnHandler.java:
##
@@ -303,8 +303,15 @@ void setWriteIdForAcidFileSinks() throws 
SemanticException, LockException {
 
   private void allocateWriteIdForAcidAnalyzeTable() throws LockException {
 if (driverContext.getPlan().getAcidAnalyzeTable() != null) {
+  //Inside a compaction transaction, only stats gathering is running which 
is not requiring a new write id,
+  //and for duplicate compaction detection it is necessary to not 
increment it.
+  boolean isWithinCompactionTxn = 
Boolean.parseBoolean(SessionState.get().getHiveVariables().get(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG));
   Table table = driverContext.getPlan().getAcidAnalyzeTable().getTable();
-  driverContext.getTxnManager().getTableWriteId(table.getDbName(), 
table.getTableName());
+  if(isWithinCompactionTxn) {
+
driverContext.getTxnManager().allocateMaxTableWriteId(table.getDbName(), 
table.getTableName());

Review Comment:
   instead of this could we supply compaction HWM here to avoid db call?





Issue Time Tracking
---

Worklog Id: (was: 763390)
Time Spent: 2h  (was: 1h 50m)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763384=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763384
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:24
Start Date: 28/Apr/22 09:24
Worklog Time Spent: 10m 
  Work Description: veghlaci05 commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860675617


##
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java:
##
@@ -197,6 +198,9 @@ static void gatherStats(CompactionInfo ci, HiveConf conf, 
String userName, Strin
   statusUpdaterConf.set(TezConfiguration.TEZ_QUEUE_NAME, 
compactionQueueName);
 }
 SessionState sessionState = 
DriverUtils.setUpSessionState(statusUpdaterConf, userName, true);
+Map hiveVariables = sessionState.getHiveVariables();
+hiveVariables.put(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG, 
"true");
+sessionState.setHiveVariables(hiveVariables);

Review Comment:
   Sure, I'll create it.





Issue Time Tracking
---

Worklog Id: (was: 763384)
Time Spent: 1h 50m  (was: 1h 40m)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26107) Worker shouldn't inject duplicate entries in `ready for cleaning` state into the compaction queue

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26107?focusedWorklogId=763383=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763383
 ]

ASF GitHub Bot logged work on HIVE-26107:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:22
Start Date: 28/Apr/22 09:22
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on code in PR #3172:
URL: https://github.com/apache/hive/pull/3172#discussion_r860674098


##
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java:
##
@@ -197,6 +198,9 @@ static void gatherStats(CompactionInfo ci, HiveConf conf, 
String userName, Strin
   statusUpdaterConf.set(TezConfiguration.TEZ_QUEUE_NAME, 
compactionQueueName);
 }
 SessionState sessionState = 
DriverUtils.setUpSessionState(statusUpdaterConf, userName, true);
+Map hiveVariables = sessionState.getHiveVariables();
+hiveVariables.put(Constants.INSIDE_COMPACTION_TRANSACTION_FLAG, 
"true");
+sessionState.setHiveVariables(hiveVariables);

Review Comment:
   could we create sessionState.setHiveVariable() method in SessionState?





Issue Time Tracking
---

Worklog Id: (was: 763383)
Time Spent: 1h 40m  (was: 1.5h)

> Worker shouldn't inject duplicate entries in `ready for cleaning` state into 
> the compaction queue
> -
>
> Key: HIVE-26107
> URL: https://issues.apache.org/jira/browse/HIVE-26107
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Végh
>Assignee: László Végh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> How to reproduce:
> 1) create an acid table and load some data ;
> 2) manually trigger the compaction for the table several times;
> 4) inspect compaction_queue: There are multiple entries in 'ready for 
> cleaning' state for the same table.
>  
> Expected behavior: All compaction request after the first one should be 
> rejected until the table is changed again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26149) Non blocking DROP DATABASE implementation

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26149?focusedWorklogId=763382=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763382
 ]

ASF GitHub Bot logged work on HIVE-26149:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:20
Start Date: 28/Apr/22 09:20
Worklog Time Spent: 10m 
  Work Description: pvary commented on code in PR #3220:
URL: https://github.com/apache/hive/pull/3220#discussion_r860672337


##
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java:
##
@@ -1869,8 +1880,7 @@ void dropDatabase(String catName, String dbName, boolean 
deleteData, boolean ign
* @throws MetaException something went wrong, usually either in the RDBMS 
or storage.
* @throws TException general thrift error.
*/
-  default void dropDatabase(String catName, String dbName, boolean deleteData,
-boolean ignoreUnknownDb)
+  default void dropDatabase(String catName, String dbName, boolean deleteData, 
boolean ignoreUnknownDb)

Review Comment:
   Nit: I think this could be deprecated as well





Issue Time Tracking
---

Worklog Id: (was: 763382)
Time Spent: 2h 10m  (was: 2h)

> Non blocking DROP DATABASE implementation
> -
>
> Key: HIVE-26149
> URL: https://issues.apache.org/jira/browse/HIVE-26149
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26179) In tez reuse container mode, asyncInitOperations are not clear.

2022-04-28 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529334#comment-17529334
 ] 

Stamatis Zampetakis commented on HIVE-26179:


In which version did you reproduce the problem? The stack trace in the summary 
does not seem to correspond to current master or 4.0.0-alpha-1 release.

Were you able to reproduce the problem also with 4.0.0-alpha-1?

Is there a minimal sequence of steps that can be used to reproduce the problem?





> In tez reuse container mode, asyncInitOperations are not clear.
> ---
>
> Key: HIVE-26179
> URL: https://issues.apache.org/jira/browse/HIVE-26179
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Tez
>Affects Versions: 1.2.1
> Environment: engine: Tez (Note: tez.am.container.reuse.enabled is 
> true)
>  
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our cluster, we found error like this.
> {code:java}
> Vertex failed, vertexName=Map 1, vertexId=vertex_1650608671415_321290_1_11, 
> diagnostics=[Task failed, taskId=task_1650608671415_321290_1_11_000422, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1650608671415_321290_1_11_000422_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:135)
>     at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>     at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>     at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>     at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>     at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>     at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators
>     at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:349)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:161)
>     ... 16 more
> Caused by: java.lang.NullPointerException
>     at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.closeOp(MapJoinOperator.java:488)
>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:684)
>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:698)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:338)
>     ... 17 more
> {code}
> When tez reuse container is enable, and use MapJoinOperator, if same tasks's 
> different taskattemp execute in same container, will throw NPE.
> By my debug, I found the second task attempt use first task's 
> asyncInitOperations. asyncInitOperations are not clear when close op, then 
> second taskattemp may use first taskattepmt's mapJoinTables which 
> HybridHashTableContainer.HashPartition is closed, so throw NPE.
> We must clear asyncInitOperations when op is closed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26149) Non blocking DROP DATABASE implementation

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26149?focusedWorklogId=763381=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763381
 ]

ASF GitHub Bot logged work on HIVE-26149:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 09:16
Start Date: 28/Apr/22 09:16
Worklog Time Spent: 10m 
  Work Description: pvary commented on code in PR #3220:
URL: https://github.com/apache/hive/pull/3220#discussion_r860668405


##
ql/src/java/org/apache/hadoop/hive/ql/ddl/database/drop/DropDatabaseAnalyzer.java:
##
@@ -49,28 +52,36 @@ public void analyzeInternal(ASTNode root) throws 
SemanticException {
 String databaseName = unescapeIdentifier(root.getChild(0).getText());
 boolean ifExists = root.getFirstChildWithType(HiveParser.TOK_IFEXISTS) != 
null;
 boolean cascade = root.getFirstChildWithType(HiveParser.TOK_CASCADE) != 
null;
+boolean isSoftDelete = HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_ACID_LOCKLESS_READS_ENABLED);
 
 Database database = getDatabase(databaseName, !ifExists);
 if (database == null) {
   return;
 }
-
 // if cascade=true, then we need to authorize the drop table action as 
well, and add the tables to the outputs
+boolean allTablesWithSuffix = false;
 if (cascade) {
   try {
-for (Table table : db.getAllTableObjects(databaseName)) {
-  // We want no lock here, as the database lock will cover the tables,
-  // and putting a lock will actually cause us to deadlock on 
ourselves.
-  outputs.add(new WriteEntity(table, 
WriteEntity.WriteType.DDL_NO_LOCK));
+List tables = db.getAllTableObjects(databaseName);
+allTablesWithSuffix = tables.stream().allMatch(
+table -> AcidUtils.isTableSoftDeleteEnabled(table, conf));
+for (Table table : tables) {
+  // Optimization used to limit number of requested locks. Check if 
table lock is needed or we could get away with single DB level lock,
+  boolean isTableLockNeeded = isSoftDelete && !allTablesWithSuffix;
+  outputs.add(new WriteEntity(table, isTableLockNeeded ?
+AcidUtils.isTableSoftDeleteEnabled(table, conf) ?
+WriteEntity.WriteType.DDL_EXCL_WRITE : 
WriteEntity.WriteType.DDL_EXCLUSIVE :
+WriteEntity.WriteType.DDL_NO_LOCK));

Review Comment:
   Would this be better:
   ```
   LockType lockType = WriteEntity.WriteType.DDL_NO_LOCK;
   if (isTableLockNeeded) {
  lockType = AcidUtils.isTableSoftDeleteEnabled(table, conf) ?
   WriteEntity.WriteType.DDL_EXCL_WRITE : 
WriteEntity.WriteType.DDL_EXCLUSIVE;
   }
   outputs.add(new WriteEntity(table, lockType));
   ```
   
   I think having too many `:` and `?` is really hard to read.





Issue Time Tracking
---

Worklog Id: (was: 763381)
Time Spent: 2h  (was: 1h 50m)

> Non blocking DROP DATABASE implementation
> -
>
> Key: HIVE-26149
> URL: https://issues.apache.org/jira/browse/HIVE-26149
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763368=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763368
 ]

ASF GitHub Bot logged work on HIVE-26184:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 08:39
Start Date: 28/Apr/22 08:39
Worklog Time Spent: 10m 
  Work Description: okumin opened a new pull request, #3253:
URL: https://github.com/apache/hive/pull/3253

   ### What changes were proposed in this pull request?
   This would reduce the time complexity of `COLLECT_SET` from `O({maximum 
length} * {num rows})` into `O({maximum length} + {num rows})`.
   
   https://issues.apache.org/jira/browse/HIVE-26184
   
   ### Why are the changes needed?
   I'm observing some reducers take much time due to this issue.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   I have run the reproduction case in HIVE-26184 with this patch and confirmed 
the reduce vertex finished more than 30x faster.




Issue Time Tracking
---

Worklog Id: (was: 763368)
Remaining Estimate: 0h
Time Spent: 10m

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26184:
--
Labels: pull-request-available  (was: )

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=763340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763340
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 08:01
Start Date: 28/Apr/22 08:01
Worklog Time Spent: 10m 
  Work Description: pvary merged PR #3252:
URL: https://github.com/apache/hive/pull/3252




Issue Time Tracking
---

Worklog Id: (was: 763340)
Time Spent: 4h 10m  (was: 4h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Assigned] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-26184:
-


> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.3, 2.3.8
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=763335=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763335
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 07:40
Start Date: 28/Apr/22 07:40
Worklog Time Spent: 10m 
  Work Description: asolimando opened a new pull request, #3252:
URL: https://github.com/apache/hive/pull/3252

   Fixing broken javadoc
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   




Issue Time Tracking
---

Worklog Id: (was: 763335)
Time Spent: 4h  (was: 3h 50m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763315=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763315
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:35
Start Date: 28/Apr/22 06:35
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860532560


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HmsThriftHttpServlet.java:
##
@@ -39,75 +48,119 @@ public class HmsThriftHttpServlet extends TServlet {
   .getLogger(HmsThriftHttpServlet.class);
 
   private static final String X_USER = MetaStoreUtils.USER_NAME_HTTP_HEADER;
-
   private final boolean isSecurityEnabled;
+  private final boolean jwtAuthEnabled;
+  public static final String AUTHORIZATION = "Authorization";
+  private JWTValidator jwtValidator;
+  private Configuration conf;
 
   public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory inProtocolFactory, TProtocolFactory outProtocolFactory) 
{
-super(processor, inProtocolFactory, outProtocolFactory);
-// This should ideally be reveiving an instance of the Configuration which 
is used for the check
+  TProtocolFactory protocolFactory, Configuration conf) {
+super(processor, protocolFactory);
+this.conf = conf;
 isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+if (MetastoreConf.getVar(conf,
+ConfVars.THRIFT_METASTORE_AUTHENTICATION).equalsIgnoreCase("jwt")) {
+  jwtAuthEnabled = true;
+} else {
+  jwtAuthEnabled = false;
+  jwtValidator = null;
+}
   }
 
-  public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory protocolFactory) {
-super(processor, protocolFactory);
-isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+  public void init() throws ServletException {
+super.init();
+if (jwtAuthEnabled) {
+  try {
+jwtValidator = new JWTValidator(this.conf);
+  } catch (Exception e) {
+throw new ServletException("Failed to initialize HmsThriftHttpServlet."
++ " Error: " + e);
+  }
+}
   }
 
   @Override
   protected void doPost(HttpServletRequest request,
   HttpServletResponse response) throws ServletException, IOException {
-
-Enumeration headerNames = request.getHeaderNames();
 if (LOG.isDebugEnabled()) {
-  LOG.debug("Logging headers in request");
+  LOG.debug(" Logging headers in doPost request");
+  Enumeration headerNames = request.getHeaderNames();
   while (headerNames.hasMoreElements()) {
 String headerName = headerNames.nextElement();
 LOG.debug("Header: [{}], Value: [{}]", headerName,
 request.getHeader(headerName));
   }
 }
-String userFromHeader = request.getHeader(X_USER);
-if (userFromHeader == null || userFromHeader.isEmpty()) {
-  LOG.error("No user header: {} found", X_USER);
-  response.sendError(HttpServletResponse.SC_FORBIDDEN,
-  "Header: " + X_USER + " missing in the request");
-  return;
-}
-
-// TODO: These should ideally be in some kind of a Cache with Weak 
referencse.
-// If HMS were to set up some kind of a session, this would go into the 
session by having
-// this filter work with a custom Processor / or set the username into the 
session
-// as is done for HS2.
-// In case of HMS, it looks like each request is independent, and there is 
no session
-// information, so the UGI needs to be set up in the Connection layer 
itself.
-UserGroupInformation clientUgi;
-// Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
-// server.
-if (isSecurityEnabled) {
-  LOG.info("Creating proxy user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
-} else {
-  LOG.info("Creating remote user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+try {
+  String userFromHeader = extractUserName(request, response);
+  UserGroupInformation clientUgi;
+  // Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
+  // server.
+  if (isSecurityEnabled) {
+LOG.info("Creating proxy user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
+  } else {
+LOG.info("Creating remote user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+  }
+  PrivilegedExceptionAction action = new 
PrivilegedExceptionAction() {
+@Override
+

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763312=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763312
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:27
Start Date: 28/Apr/22 06:27
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on PR #3233:
URL: https://github.com/apache/hive/pull/3233#issuecomment-796249

   > @dengzhhu653 @yongzhi @harishjp Could you please review this PR please? 
Thank you
   
   Thank you for informing me. Some minor comments, others look good to me!




Issue Time Tracking
---

Worklog Id: (was: 763312)
Time Spent: 1h 40m  (was: 1.5h)

> JWT authentication for Thrift over HTTP in HiveMetaStore
> 
>
> Key: HIVE-26071
> URL: https://issues.apache.org/jira/browse/HIVE-26071
> Project: Hive
>  Issue Type: New Feature
>  Components: Standalone Metastore
>Reporter: Sourabh Goyal
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> HIVE-25575 recently added a support for JWT authentication in HS2. This Jira 
> aims to add the same feature in HMS



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763311
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:25
Start Date: 28/Apr/22 06:25
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860526502


##
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java:
##
@@ -605,27 +608,44 @@ private THttpClient createHttpClient(URI store, boolean 
useSSL) throws MetaExcep
 String path = MetaStoreUtils.getHttpPath(MetastoreConf.getVar(conf, 
ConfVars.THRIFT_HTTP_PATH));
 String httpUrl = (useSSL ? "https://; : "http://;) + store.getHost() + ":" 
+ store.getPort() + path;
 
-String user = MetastoreConf.getVar(conf, 
ConfVars.METASTORE_CLIENT_PLAIN_USERNAME);
-if (user == null || user.equals("")) {
-  try {
-LOG.debug("No username passed in config " + 
ConfVars.METASTORE_CLIENT_PLAIN_USERNAME.getHiveName() +
-". Trying to get the current user from UGI" );
-user = UserGroupInformation.getCurrentUser().getShortUserName();
-  } catch (IOException e) {
-throw new MetaException("Failed to get client username from UGI");
+HttpClientBuilder httpClientBuilder = HttpClientBuilder.create();
+String authType = MetastoreConf.getAsString(conf, 
ConfVars.METASTORE_CLIENT_AUTH_MODE).toLowerCase(

Review Comment:
   nit: the var `METASTORE_CLIENT_AUTH_MODE` is strictly allowed to be some 
modes, maybe we do not need to localize here?





Issue Time Tracking
---

Worklog Id: (was: 763311)
Time Spent: 1.5h  (was: 1h 20m)

> JWT authentication for Thrift over HTTP in HiveMetaStore
> 
>
> Key: HIVE-26071
> URL: https://issues.apache.org/jira/browse/HIVE-26071
> Project: Hive
>  Issue Type: New Feature
>  Components: Standalone Metastore
>Reporter: Sourabh Goyal
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> HIVE-25575 recently added a support for JWT authentication in HS2. This Jira 
> aims to add the same feature in HMS



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763306=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763306
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:07
Start Date: 28/Apr/22 06:07
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860516284


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HmsThriftHttpServlet.java:
##
@@ -39,75 +48,119 @@ public class HmsThriftHttpServlet extends TServlet {
   .getLogger(HmsThriftHttpServlet.class);
 
   private static final String X_USER = MetaStoreUtils.USER_NAME_HTTP_HEADER;
-
   private final boolean isSecurityEnabled;
+  private final boolean jwtAuthEnabled;
+  public static final String AUTHORIZATION = "Authorization";
+  private JWTValidator jwtValidator;
+  private Configuration conf;
 
   public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory inProtocolFactory, TProtocolFactory outProtocolFactory) 
{
-super(processor, inProtocolFactory, outProtocolFactory);
-// This should ideally be reveiving an instance of the Configuration which 
is used for the check
+  TProtocolFactory protocolFactory, Configuration conf) {
+super(processor, protocolFactory);
+this.conf = conf;
 isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+if (MetastoreConf.getVar(conf,
+ConfVars.THRIFT_METASTORE_AUTHENTICATION).equalsIgnoreCase("jwt")) {
+  jwtAuthEnabled = true;
+} else {
+  jwtAuthEnabled = false;
+  jwtValidator = null;
+}
   }
 
-  public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory protocolFactory) {
-super(processor, protocolFactory);
-isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+  public void init() throws ServletException {
+super.init();
+if (jwtAuthEnabled) {
+  try {
+jwtValidator = new JWTValidator(this.conf);
+  } catch (Exception e) {
+throw new ServletException("Failed to initialize HmsThriftHttpServlet."
++ " Error: " + e);
+  }
+}
   }
 
   @Override
   protected void doPost(HttpServletRequest request,
   HttpServletResponse response) throws ServletException, IOException {
-
-Enumeration headerNames = request.getHeaderNames();
 if (LOG.isDebugEnabled()) {
-  LOG.debug("Logging headers in request");
+  LOG.debug(" Logging headers in doPost request");
+  Enumeration headerNames = request.getHeaderNames();
   while (headerNames.hasMoreElements()) {
 String headerName = headerNames.nextElement();
 LOG.debug("Header: [{}], Value: [{}]", headerName,
 request.getHeader(headerName));
   }
 }
-String userFromHeader = request.getHeader(X_USER);
-if (userFromHeader == null || userFromHeader.isEmpty()) {
-  LOG.error("No user header: {} found", X_USER);
-  response.sendError(HttpServletResponse.SC_FORBIDDEN,
-  "Header: " + X_USER + " missing in the request");
-  return;
-}
-
-// TODO: These should ideally be in some kind of a Cache with Weak 
referencse.
-// If HMS were to set up some kind of a session, this would go into the 
session by having
-// this filter work with a custom Processor / or set the username into the 
session
-// as is done for HS2.
-// In case of HMS, it looks like each request is independent, and there is 
no session
-// information, so the UGI needs to be set up in the Connection layer 
itself.
-UserGroupInformation clientUgi;
-// Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
-// server.
-if (isSecurityEnabled) {
-  LOG.info("Creating proxy user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
-} else {
-  LOG.info("Creating remote user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+try {
+  String userFromHeader = extractUserName(request, response);
+  UserGroupInformation clientUgi;
+  // Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
+  // server.
+  if (isSecurityEnabled) {
+LOG.info("Creating proxy user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
+  } else {
+LOG.info("Creating remote user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+  }
+  PrivilegedExceptionAction action = new 
PrivilegedExceptionAction() {
+@Override
+

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763305=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763305
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:06
Start Date: 28/Apr/22 06:06
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860516284


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HmsThriftHttpServlet.java:
##
@@ -39,75 +48,119 @@ public class HmsThriftHttpServlet extends TServlet {
   .getLogger(HmsThriftHttpServlet.class);
 
   private static final String X_USER = MetaStoreUtils.USER_NAME_HTTP_HEADER;
-
   private final boolean isSecurityEnabled;
+  private final boolean jwtAuthEnabled;
+  public static final String AUTHORIZATION = "Authorization";
+  private JWTValidator jwtValidator;
+  private Configuration conf;
 
   public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory inProtocolFactory, TProtocolFactory outProtocolFactory) 
{
-super(processor, inProtocolFactory, outProtocolFactory);
-// This should ideally be reveiving an instance of the Configuration which 
is used for the check
+  TProtocolFactory protocolFactory, Configuration conf) {
+super(processor, protocolFactory);
+this.conf = conf;
 isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+if (MetastoreConf.getVar(conf,
+ConfVars.THRIFT_METASTORE_AUTHENTICATION).equalsIgnoreCase("jwt")) {
+  jwtAuthEnabled = true;
+} else {
+  jwtAuthEnabled = false;
+  jwtValidator = null;
+}
   }
 
-  public HmsThriftHttpServlet(TProcessor processor,
-  TProtocolFactory protocolFactory) {
-super(processor, protocolFactory);
-isSecurityEnabled = UserGroupInformation.isSecurityEnabled();
+  public void init() throws ServletException {
+super.init();
+if (jwtAuthEnabled) {
+  try {
+jwtValidator = new JWTValidator(this.conf);
+  } catch (Exception e) {
+throw new ServletException("Failed to initialize HmsThriftHttpServlet."
++ " Error: " + e);
+  }
+}
   }
 
   @Override
   protected void doPost(HttpServletRequest request,
   HttpServletResponse response) throws ServletException, IOException {
-
-Enumeration headerNames = request.getHeaderNames();
 if (LOG.isDebugEnabled()) {
-  LOG.debug("Logging headers in request");
+  LOG.debug(" Logging headers in doPost request");
+  Enumeration headerNames = request.getHeaderNames();
   while (headerNames.hasMoreElements()) {
 String headerName = headerNames.nextElement();
 LOG.debug("Header: [{}], Value: [{}]", headerName,
 request.getHeader(headerName));
   }
 }
-String userFromHeader = request.getHeader(X_USER);
-if (userFromHeader == null || userFromHeader.isEmpty()) {
-  LOG.error("No user header: {} found", X_USER);
-  response.sendError(HttpServletResponse.SC_FORBIDDEN,
-  "Header: " + X_USER + " missing in the request");
-  return;
-}
-
-// TODO: These should ideally be in some kind of a Cache with Weak 
referencse.
-// If HMS were to set up some kind of a session, this would go into the 
session by having
-// this filter work with a custom Processor / or set the username into the 
session
-// as is done for HS2.
-// In case of HMS, it looks like each request is independent, and there is 
no session
-// information, so the UGI needs to be set up in the Connection layer 
itself.
-UserGroupInformation clientUgi;
-// Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
-// server.
-if (isSecurityEnabled) {
-  LOG.info("Creating proxy user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
-} else {
-  LOG.info("Creating remote user for: {}", userFromHeader);
-  clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+try {
+  String userFromHeader = extractUserName(request, response);
+  UserGroupInformation clientUgi;
+  // Temporary, and useless for now. Here only to allow this to work on an 
otherwise kerberized
+  // server.
+  if (isSecurityEnabled) {
+LOG.info("Creating proxy user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createProxyUser(userFromHeader, 
UserGroupInformation.getLoginUser());
+  } else {
+LOG.info("Creating remote user for: {}", userFromHeader);
+clientUgi = UserGroupInformation.createRemoteUser(userFromHeader);
+  }
+  PrivilegedExceptionAction action = new 
PrivilegedExceptionAction() {
+@Override
+

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763304=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763304
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:02
Start Date: 28/Apr/22 06:02
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860514305


##
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestRemoteHiveMetastoreWithHttpJwt.java:
##
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import com.github.tomakehurst.wiremock.junit.WireMockRule;
+import com.nimbusds.jose.JWSAlgorithm;
+import com.nimbusds.jose.JWSHeader;
+import com.nimbusds.jose.JWSSigner;
+import com.nimbusds.jose.crypto.RSASSASigner;
+import com.nimbusds.jose.jwk.RSAKey;
+import com.nimbusds.jwt.JWTClaimsSet;
+import com.nimbusds.jwt.SignedJWT;
+
+import java.io.File;
+import java.lang.reflect.Field;
+import java.lang.reflect.Modifier;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.thrift.transport.TTransportException;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.ClassRule;
+import java.nio.charset.StandardCharsets;
+import java.util.Date;
+import java.util.UUID;
+import java.util.concurrent.TimeUnit;
+import org.junit.Ignore;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static com.github.tomakehurst.wiremock.client.WireMock.get;
+import static com.github.tomakehurst.wiremock.client.WireMock.ok;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+@Category(MetastoreUnitTest.class)
+public class TestRemoteHiveMetastoreWithHttpJwt {
+  private static final Map DEFAULTS = new 
HashMap<>(System.getenv());
+  private static Map envMap;
+
+  private static String baseDir = System.getProperty("basedir");
+  private static final File jwtAuthorizedKeyFile =
+  new File(baseDir,"src/test/resources/auth/jwt/jwt-authorized-key.json");
+  private static final File jwtUnauthorizedKeyFile =
+  new 
File(baseDir,"src/test/resources/auth/jwt/jwt-unauthorized-key.json");
+  private static final File jwtVerificationJWKSFile =
+  new 
File(baseDir,"src/test/resources/auth/jwt/jwt-verification-jwks.json");
+
+  private static final String USER_1 = "HMS_TEST_USER_1";
+  private static final String TEST_DB_NAME_PREFIX = "HMS_JWT_AUTH_DB";
+  private static final Logger LOG = 
LoggerFactory.getLogger(TestRemoteHiveMetastoreWithHttpJwt.class);
+  //private static MiniHS2 miniHS2;
+
+  private static final int MOCK_JWKS_SERVER_PORT = 8089;
+  @ClassRule
+  public static final WireMockRule MOCK_JWKS_SERVER = new 
WireMockRule(MOCK_JWKS_SERVER_PORT);
+
+  /**
+   * This is a hack to make environment variables modifiable.
+   * Ref: 
https://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java.
+   */
+  @BeforeClass
+  public static void makeEnvModifiable() throws Exception {
+envMap = new HashMap<>();
+Class envClass = Class.forName("java.lang.ProcessEnvironment");
+Field theEnvironmentField = envClass.getDeclaredField("theEnvironment");
+Field theUnmodifiableEnvironmentField = 
envClass.getDeclaredField("theUnmodifiableEnvironment");
+removeStaticFinalAndSetValue(theEnvironmentField, envMap);
+

[jira] [Work logged] (HIVE-26071) JWT authentication for Thrift over HTTP in HiveMetaStore

2022-04-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26071?focusedWorklogId=763303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763303
 ]

ASF GitHub Bot logged work on HIVE-26071:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 06:01
Start Date: 28/Apr/22 06:01
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on code in PR #3233:
URL: https://github.com/apache/hive/pull/3233#discussion_r860513871


##
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestRemoteHiveMetastoreWithHttpJwt.java:
##
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import com.github.tomakehurst.wiremock.junit.WireMockRule;
+import com.nimbusds.jose.JWSAlgorithm;
+import com.nimbusds.jose.JWSHeader;
+import com.nimbusds.jose.JWSSigner;
+import com.nimbusds.jose.crypto.RSASSASigner;
+import com.nimbusds.jose.jwk.RSAKey;
+import com.nimbusds.jwt.JWTClaimsSet;
+import com.nimbusds.jwt.SignedJWT;
+
+import java.io.File;
+import java.lang.reflect.Field;
+import java.lang.reflect.Modifier;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.thrift.transport.TTransportException;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.ClassRule;
+import java.nio.charset.StandardCharsets;
+import java.util.Date;
+import java.util.UUID;
+import java.util.concurrent.TimeUnit;
+import org.junit.Ignore;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static com.github.tomakehurst.wiremock.client.WireMock.get;
+import static com.github.tomakehurst.wiremock.client.WireMock.ok;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+@Category(MetastoreUnitTest.class)
+public class TestRemoteHiveMetastoreWithHttpJwt {
+  private static final Map DEFAULTS = new 
HashMap<>(System.getenv());
+  private static Map envMap;
+
+  private static String baseDir = System.getProperty("basedir");
+  private static final File jwtAuthorizedKeyFile =
+  new File(baseDir,"src/test/resources/auth/jwt/jwt-authorized-key.json");
+  private static final File jwtUnauthorizedKeyFile =
+  new 
File(baseDir,"src/test/resources/auth/jwt/jwt-unauthorized-key.json");
+  private static final File jwtVerificationJWKSFile =
+  new 
File(baseDir,"src/test/resources/auth/jwt/jwt-verification-jwks.json");
+
+  private static final String USER_1 = "HMS_TEST_USER_1";
+  private static final String TEST_DB_NAME_PREFIX = "HMS_JWT_AUTH_DB";
+  private static final Logger LOG = 
LoggerFactory.getLogger(TestRemoteHiveMetastoreWithHttpJwt.class);
+  //private static MiniHS2 miniHS2;
+
+  private static final int MOCK_JWKS_SERVER_PORT = 8089;
+  @ClassRule
+  public static final WireMockRule MOCK_JWKS_SERVER = new 
WireMockRule(MOCK_JWKS_SERVER_PORT);
+
+  /**
+   * This is a hack to make environment variables modifiable.
+   * Ref: 
https://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java.
+   */
+  @BeforeClass
+  public static void makeEnvModifiable() throws Exception {
+envMap = new HashMap<>();
+Class envClass = Class.forName("java.lang.ProcessEnvironment");
+Field theEnvironmentField = envClass.getDeclaredField("theEnvironment");
+Field theUnmodifiableEnvironmentField = 
envClass.getDeclaredField("theUnmodifiableEnvironment");
+removeStaticFinalAndSetValue(theEnvironmentField, envMap);
+

45 matches

Mail list logo