[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470581=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470581
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 05:37
Start Date: 14/Aug/20 05:37
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on pull request #1280:
URL: https://github.com/apache/hive/pull/1280#issuecomment-673894751


   @abstractdog  I missed that call. I think that covers it.
   Good work.
   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470581)
Time Spent: 8h  (was: 7h 50m)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 436 465 696 bits, so 
> merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR 
> operation, which is very hot codepath, but can be parallelized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470580=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470580
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 05:37
Start Date: 14/Aug/20 05:37
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1280:
URL: https://github.com/apache/hive/pull/1280#discussion_r470419947



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##
@@ -1126,6 +1137,7 @@ protected void initializeOp(Configuration hconf) throws 
HiveException {
 VectorAggregateExpression vecAggrExpr = null;
 try {
   vecAggrExpr = ctor.newInstance(vecAggrDesc);
+  vecAggrExpr.withConf(hconf);

Review comment:
   Sadly, I need to agree with conf abusing in (hive) codebase :) somehow I 
don't really like instanceof stuff here, only for a single expression, 
moreover, I wanted to find a general way to provide some configuration to 
expressions, as this patch showed that they might need that (in the future). On 
the other hand, explicitly calling a specific constructor for different types 
could be a kind of documentation in one place about "how to instantiate" these 
expressions. I'm about to refactor this logic to a separate method in 
VectorGroupByOperator and let this patch go!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470580)
Time Spent: 7h 50m  (was: 7h 40m)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> 

[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470577=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470577
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 05:29
Start Date: 14/Aug/20 05:29
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on pull request #1280:
URL: https://github.com/apache/hive/pull/1280#issuecomment-673892379


   > @abstractdog
   > I am almost ok with this patch. However I still dont understand how this 
integrates with `ProcessingModeHashAggregate`. Since there are multiple 
VectorAggregationBufferRows in hash mode, I think we should `finish` each of 
them as we process them. Otherwise, we pass to the next operator in the 
pipeline without completing the bloom filter. Also, since hash mode dynamically 
allocates and frees VectorAggregationBufferRows these `finish`es should happen 
as we deallocate each of them, rather than only at the end of the operator.
   
   Good point. I was creating this patch by focusing on finishing buffers 
correctly, I think I've already taken care of by this, please take a look:
   
https://github.com/apache/hive/pull/1280/commits/0ada66534a937b8f4492d14f508903fa98402aed#diff-07c28d3f5c72db581b9cd4fa424a0ecbR675
   
   As you can see, I'm calling finish before every instance of writeSingleRow. 
I'm assuming that writeSingleRow is a point where a buffer should be finished 
for writing. In ProcessingModeHashAggregate, the above part is enclosed in an 
iteration on buffers in flush method. Are you aware of any other places where I 
should finish a buffer?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470577)
Time Spent: 7h 40m  (was: 7.5h)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> 

[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470547=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470547
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 03:10
Start Date: 14/Aug/20 03:10
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470385260



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {
+  return false;
+}
+return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != 
null);
+  }
+
+  public static EncryptionZone getEncryptionZoneForPath(Path path, 
Configuration conf) throws IOException {
+URI uri = path.getFileSystem(conf).getUri();
+if ("hdfs".equals(uri.getScheme())) {
+  HdfsAdmin hdfsAdmin = new HdfsAdmin(uri, conf);
+  if (path.getFileSystem(conf).exists(path)) {
+return hdfsAdmin.getEncryptionZoneForPath(path);
+  } else if (!path.getParent().equals(path)) {

Review comment:
   This is an exit condition. When path is the root, then path.getParent 
should be equal to path and exit the recursion. These utils are picked from 
hadoop code Hadoop23shims.java. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470547)
Time Spent: 1h 10m  (was: 1h)

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470536
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 02:26
Start Date: 14/Aug/20 02:26
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470374769



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {

Review comment:
   If the scheme itself is not hdfs, we needn't make a call to 
getEncryptionZoneForPath. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470536)
Time Spent: 1h  (was: 50m)

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470535=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470535
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 02:24
Start Date: 14/Aug/20 02:24
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470374769



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {

Review comment:
   If the scheme itself is not hdfs, we needn't make a call to 
getEncryptionZoneForPath. This will save a file system call and will be faster.

##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {
+  return false;
+}
+return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != 
null);

Review comment:
   Its a static method and hence used with the class name





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470535)
Time Spent: 50m  (was: 40m)

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>

[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470534=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470534
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 02:20
Start Date: 14/Aug/20 02:20
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470373805



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {
+  return false;
+}
+return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != 
null);
+  }
+
+  public static EncryptionZone getEncryptionZoneForPath(Path path, 
Configuration conf) throws IOException {
+URI uri = path.getFileSystem(conf).getUri();
+if ("hdfs".equals(uri.getScheme())) {
+  HdfsAdmin hdfsAdmin = new HdfsAdmin(uri, conf);
+  if (path.getFileSystem(conf).exists(path)) {
+return hdfsAdmin.getEncryptionZoneForPath(path);
+  } else if (!path.getParent().equals(path)) {
+return getEncryptionZoneForPath(path.getParent(), conf);
+  } else {
+return null;
+  }
+}
+return null;
+  }
+
+  public static void createEncryptionZone(Path path, String keyName, 
Configuration conf) throws IOException {

Review comment:
   Better to keep it as part of same utility class. Tests can also use this 
from utility. Its not a test code, its a utility method





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470534)
Time Spent: 40m  (was: 0.5h)

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470533=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470533
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 02:19
Start Date: 14/Aug/20 02:19
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470373619



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {
+  return false;
+}
+return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != 
null);

Review comment:
   Its a static method





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470533)
Time Spent: 0.5h  (was: 20m)

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470493=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470493
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 14/Aug/20 00:02
Start Date: 14/Aug/20 00:02
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on pull request #1280:
URL: https://github.com/apache/hive/pull/1280#issuecomment-673767435


   @abstractdog 
   I am almost ok with this patch. However I still dont understand how this 
integrates with `ProcessingModeHashAggregate`. Since there are multiple 
VectorAggregationBufferRows in hash mode, I think we should `finish` each of 
them as we process them. Otherwise, we pass to the next operator in the 
pipeline without completing the bloom filter. Also, since hash mode dynamically 
allocates and frees VectorAggregationBufferRows these `finish`es should happen 
as we deallocate each of them, rather than only at the end of the operator.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470493)
Time Spent: 7.5h  (was: 7h 20m)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 436 465 696 bits, so 
> merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR 
> operation, which is very hot codepath, but can be parallelized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470489
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 23:56
Start Date: 13/Aug/20 23:56
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on a change in pull request #1280:
URL: https://github.com/apache/hive/pull/1280#discussion_r470310851



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##
@@ -1126,6 +1137,7 @@ protected void initializeOp(Configuration hconf) throws 
HiveException {
 VectorAggregateExpression vecAggrExpr = null;
 try {
   vecAggrExpr = ctor.newInstance(vecAggrDesc);
+  vecAggrExpr.withConf(hconf);

Review comment:
   I think making `VectorUDAFBloomFilterMerge` construction a special case 
and supplying the single int to that constructor is much cleaner. While trying 
to avoid that specialization, you are injecting the conf object to all the 
other classes.
   
   I specifically despise passing conf object around in Hive as it is abused so 
much in every part of the codebase. I'd prefer the other way but I won't insist 
on it. It is not a big deal for this patch.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470489)
Time Spent: 7h 20m  (was: 7h 10m)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 

[jira] [Updated] (HIVE-24039) Update jquery version to mitigate CVE-2020-11023

2020-08-13 Thread Rajkumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkumar Singh updated HIVE-24039:
--
Summary: Update jquery version to mitigate CVE-2020-11023  (was: update 
jquery version to mitigate CVE-2020-11023)

> Update jquery version to mitigate CVE-2020-11023
> 
>
> Key: HIVE-24039
> URL: https://issues.apache.org/jira/browse/HIVE-24039
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Rajkumar Singh
>Assignee: Rajkumar Singh
>Priority: Major
>
> there is known vulnerability in jquery version used by hive, with this jira 
> plan is to upgrade the jquery version 3.5.0 where it's been fixed. more 
> details about the vulnerability can be found here.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11023



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470429=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470429
 ]

ASF GitHub Bot logged work on HIVE-24032:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 21:00
Start Date: 13/Aug/20 21:00
Worklog Time Spent: 10m 
  Work Description: pkumarsinha commented on a change in pull request #1396:
URL: https://github.com/apache/hive/pull/1396#discussion_r470230243



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+

Review comment:
   Add a private constructor to avoid any accidental object creation

##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {

Review comment:
   nit: Can we rename it to EncryptionZoneUtils

##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore.utils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.client.HdfsAdmin;
+import org.apache.hadoop.hdfs.protocol.EncryptionZone;
+
+import java.io.IOException;
+import java.net.URI;
+
+public class EncryptionFileUtils {
+
+  public static boolean isPathEncrypted(Path path, Configuration conf) throws 
IOException {
+Path fullPath;
+if (path.isAbsolute()) {
+  fullPath = path;
+} else {
+  fullPath = path.getFileSystem(conf).makeQualified(path);
+}
+if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) {
+  return 

[jira] [Assigned] (HIVE-24039) update jquery version to mitigate CVE-2020-11023

2020-08-13 Thread Rajkumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkumar Singh reassigned HIVE-24039:
-


> update jquery version to mitigate CVE-2020-11023
> 
>
> Key: HIVE-24039
> URL: https://issues.apache.org/jira/browse/HIVE-24039
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Rajkumar Singh
>Assignee: Rajkumar Singh
>Priority: Major
>
> there is known vulnerability in jquery version used by hive, with this jira 
> plan is to upgrade the jquery version 3.5.0 where it's been fixed. more 
> details about the vulnerability can be found here.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11023



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23972) Add external client ID to LLAP external client

2020-08-13 Thread Jason Dere (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-23972:
--
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Fix merged by [~prasanthj]

> Add external client ID to LLAP external client
> --
>
> Key: HIVE-23972
> URL: https://issues.apache.org/jira/browse/HIVE-23972
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There currently is not a good way to tell which currently running LLAP tasks 
> are from external LLAP clients, and also no good way to know which 
> application is submitting these external LLAP requests.
> One possible solution for this is to add an option for the external LLAP 
> client to pass in an external client ID, which can get logged by HiveServer2 
> during the getSplits request, as well as displayed from the LLAP 
> executorsStatus.
> cc [~ShubhamChaurasia]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23972) Add external client ID to LLAP external client

2020-08-13 Thread Jason Dere (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-23972:
--
Status: Patch Available  (was: Open)

> Add external client ID to LLAP external client
> --
>
> Key: HIVE-23972
> URL: https://issues.apache.org/jira/browse/HIVE-23972
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There currently is not a good way to tell which currently running LLAP tasks 
> are from external LLAP clients, and also no good way to know which 
> application is submitting these external LLAP requests.
> One possible solution for this is to add an option for the external LLAP 
> client to pass in an external client ID, which can get logged by HiveServer2 
> during the getSplits request, as well as displayed from the LLAP 
> executorsStatus.
> cc [~ShubhamChaurasia]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23972) Add external client ID to LLAP external client

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23972?focusedWorklogId=470407=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470407
 ]

ASF GitHub Bot logged work on HIVE-23972:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 20:31
Start Date: 13/Aug/20 20:31
Worklog Time Spent: 10m 
  Work Description: prasanthj merged pull request #1350:
URL: https://github.com/apache/hive/pull/1350


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470407)
Time Spent: 40m  (was: 0.5h)

> Add external client ID to LLAP external client
> --
>
> Key: HIVE-23972
> URL: https://issues.apache.org/jira/browse/HIVE-23972
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There currently is not a good way to tell which currently running LLAP tasks 
> are from external LLAP clients, and also no good way to know which 
> application is submitting these external LLAP requests.
> One possible solution for this is to add an option for the external LLAP 
> client to pass in an external client ID, which can get logged by HiveServer2 
> during the getSplits request, as well as displayed from the LLAP 
> executorsStatus.
> cc [~ShubhamChaurasia]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HIVE-23965) Improve plan regression tests using TPCDS30TB metastore dump and custom configs

2020-08-13 Thread Jesus Camacho Rodriguez (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176406#comment-17176406
 ] 

Jesus Camacho Rodriguez edited comment on HIVE-23965 at 8/13/20, 3:57 PM:
--

+1 on removing old driver, since the new one fixes issues with the existing 
one. I do not think having the old one around adds much value and updating all 
those q files will be a pain.

[~zabetak], [~kgyrtkirk], if this PR is ready to be merged, I think the removal 
can be done in a follow-up.


was (Author: jcamachorodriguez):
+1 on removing old driver, since it fixes issues with the existing one. I do 
not think having the old one around adds much value and updating all those q 
files will be a pain.

[~zabetak], [~kgyrtkirk], if this PR is ready to be merged, I think the removal 
can be done in a follow-up.

> Improve plan regression tests using TPCDS30TB metastore dump and custom 
> configs
> ---
>
> Key: HIVE-23965
> URL: https://issues.apache.org/jira/browse/HIVE-23965
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The existing regression tests (HIVE-12586) based on TPC-DS have certain 
> shortcomings:
> The table statistics do not reflect cardinalities from a specific TPC-DS 
> scale factor (SF). Some tables are from a 30TB dataset, others from 200GB 
> dataset, and others from a 3GB dataset. This mix leads to plans that may 
> never appear when using an actual TPC-DS dataset. 
> The existing statistics do not contain information about partitions something 
> that can have a big impact on the resulting plans.
> The existing regression tests rely on more or less on the default 
> configuration (hive-site.xml). In real-life scenarios though some of the 
> configurations differ and may impact the choices of the optimizer.
> This issue aims to address the above shortcomings by using a curated 
> TPCDS30TB metastore dump along with some custom hive configurations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24015) Disable query-based compaction on MR execution engine

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24015?focusedWorklogId=470303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470303
 ]

ASF GitHub Bot logged work on HIVE-24015:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 15:57
Start Date: 13/Aug/20 15:57
Worklog Time Spent: 10m 
  Work Description: klcopp merged pull request #1375:
URL: https://github.com/apache/hive/pull/1375


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470303)
Time Spent: 20m  (was: 10m)

> Disable query-based compaction on MR execution engine
> -
>
> Key: HIVE-24015
> URL: https://issues.apache.org/jira/browse/HIVE-24015
> Project: Hive
>  Issue Type: Task
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Major compaction can be run when the execution engine is MR. This can cause 
> data loss a la HIVE-23703 (the fix for data loss when the execution engine is 
> MR was reverted by HIVE-23763).
> Currently minor compaction can only be run when the execution engine is Tez, 
> otherwise it falls back to MR (non-query-based) compaction. We should extend 
> this functionality to major compaction as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24020) Automatic Compaction not working in existing partitions for Streaming Ingest with Dynamic Partition

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24020?focusedWorklogId=470224=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470224
 ]

ASF GitHub Bot logged work on HIVE-24020:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 13:18
Start Date: 13/Aug/20 13:18
Worklog Time Spent: 10m 
  Work Description: vpnvishv commented on pull request #1382:
URL: https://github.com/apache/hive/pull/1382#issuecomment-673473084


   @pvary @laszlopinter86 @klcopp Can you please review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470224)
Time Spent: 20m  (was: 10m)

> Automatic Compaction not working in existing partitions for Streaming Ingest 
> with Dynamic Partition
> ---
>
> Key: HIVE-24020
> URL: https://issues.apache.org/jira/browse/HIVE-24020
> Project: Hive
>  Issue Type: Bug
>  Components: Streaming, Transactions
>Affects Versions: 4.0.0, 3.1.2
>Reporter: Vipin Vishvkarma
>Assignee: Vipin Vishvkarma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This issue happens when we try to do streaming ingest with dynamic partition 
> on already existing partitions. I checked in the code, we have following 
> check in the AbstractRecordWriter.
>  
> {code:java}
> PartitionInfo partitionInfo = 
> conn.createPartitionIfNotExists(partitionValues);
> // collect the newly added partitions. connection.commitTransaction() will 
> report the dynamically added
> // partitions to TxnHandler
> if (!partitionInfo.isExists()) {
>   addedPartitions.add(partitionInfo.getName());
> } else {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Partition {} already exists for table {}",
> partitionInfo.getName(), fullyQualifiedTableName);
>   }
> }
> {code}
> Above *addedPartitions* is passed to *addDynamicPartitions* during 
> TransactionBatch commit. So in case of already existing partitions, 
> *addedPartitions* will be empty and *addDynamicPartitions* **will not move 
> entries from TXN_COMPONENTS to COMPLETED_TXN_COMPONENTS. This results in 
> Initiator not able to trigger auto compaction.
> Another issue which has been observed is, we are not clearing 
> *addedPartitions* on writer close, which results in information flowing 
> across transactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2020-08-13 Thread Noritaka Sekiyama (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177001#comment-17177001
 ] 

Noritaka Sekiyama edited comment on HIVE-12679 at 8/13/20, 1:14 PM:


I talked with Austin, and I submitted a new pull-request based on the patch 
which has been already uploaded to this issue. 
https://github.com/apache/hive/pull/1402

Hive committers - can you review the patch again?


was (Author: moomindani):
I talked with Austin, and I submitted a new pull-request based on the patch 
which has been already uploaded to this issue.

Hive committers - can you review the patch again?

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2020-08-13 Thread Noritaka Sekiyama (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177001#comment-17177001
 ] 

Noritaka Sekiyama commented on HIVE-12679:
--

I talked with Austin, and I submitted a new pull-request based on the patch 
which has been already uploaded to this issue.

Hive committers - can you review the patch again?

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-12679:
--
Labels: metastore pull-request-available  (was: metastore)

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24001) Don't cache MapWork in tez/ObjectCache during query-based compaction

2020-08-13 Thread Karen Coppage (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage updated HIVE-24001:
-
Fix Version/s: 4.0.0

> Don't cache MapWork in tez/ObjectCache during query-based compaction
> 
>
> Key: HIVE-24001
> URL: https://issues.apache.org/jira/browse/HIVE-24001
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Query-based major compaction can fail intermittently with the following issue:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: One writer is 
> supposed to handle only one bucket. We saw these 2 different buckets: 1 and 6
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:77)
> {code}
> This is consistently preceded in the application log with:
> {code:java}
>  [INFO] [TezChild] |tez.ObjectCache|: Found 
> hive_20200804185133_f04cca69-fa30-4f1b-a5fe-80fc2d749f48_Map 1__MAP_PLAN__ in 
> cache with value: org.apache.hadoop.hive.ql.plan.MapWork@74652101
> {code}
> Alternatively, when MapRecordProcessor doesn't find mapWork in 
> tez/ObjectCache (but instead caches mapWork), major compaction succeeds.
> The failure happens because, if MapWork is reused, 
> GenericUDFValidateAcidSortOrder (which is called during compaction) is also 
> reused on splits belonging to two different buckets, which produces an error.
> Solution is to avoid storing MapWork in the ObjectCache during query-based 
> compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24024) Improve logging around CompactionTxnHandler

2020-08-13 Thread Karen Coppage (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage resolved HIVE-24024.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Submitted to master. Thanks [~lpinter] for the review!

> Improve logging around CompactionTxnHandler
> ---
>
> Key: HIVE-24024
> URL: https://issues.apache.org/jira/browse/HIVE-24024
> Project: Hive
>  Issue Type: Improvement
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> CompactionTxnHandler often doesn't log the preparedStatement parameters, 
> which is really painful when compaction isn't working the way it should. Also 
> expand logging around compaction Cleaner, Initiator, Worker. And some 
> formatting cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24024) Improve logging around CompactionTxnHandler

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24024?focusedWorklogId=470194=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470194
 ]

ASF GitHub Bot logged work on HIVE-24024:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 11:59
Start Date: 13/Aug/20 11:59
Worklog Time Spent: 10m 
  Work Description: klcopp merged pull request #1389:
URL: https://github.com/apache/hive/pull/1389


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470194)
Time Spent: 20m  (was: 10m)

> Improve logging around CompactionTxnHandler
> ---
>
> Key: HIVE-24024
> URL: https://issues.apache.org/jira/browse/HIVE-24024
> Project: Hive
>  Issue Type: Improvement
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> CompactionTxnHandler often doesn't log the preparedStatement parameters, 
> which is really painful when compaction isn't working the way it should. Also 
> expand logging around compaction Cleaner, Initiator, Worker. And some 
> formatting cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23993) Handle irrecoverable errors

2020-08-13 Thread Anishek Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anishek Agarwal updated HIVE-23993:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to master, Thanks for the patch [~aasha] and review [~pkumarsinha]

> Handle irrecoverable errors
> ---
>
> Key: HIVE-23993
> URL: https://issues.apache.org/jira/browse/HIVE-23993
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23993.01.patch, HIVE-23993.02.patch, 
> HIVE-23993.03.patch, HIVE-23993.04.patch, HIVE-23993.05.patch, 
> HIVE-23993.06.patch, HIVE-23993.07.patch, HIVE-23993.08.patch, Retry Logic 
> for Replication.pdf
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24037) Parallelize hash table constructions in map joins

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24037:
--
Labels: pull-request-available  (was: )

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=470163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470163
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 09:46
Start Date: 13/Aug/20 09:46
Worklog Time Spent: 10m 
  Work Description: ramesh0201 opened a new pull request #1401:
URL: https://github.com/apache/hive/pull/1401


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470163)
Remaining Estimate: 0h
Time Spent: 10m

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-23981) Use task counter enum to get the approximate counter value

2020-08-13 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-23981.

Resolution: Fixed

> Use task counter enum to get the approximate counter value
> --
>
> Key: HIVE-23981
> URL: https://issues.apache.org/jira/browse/HIVE-23981
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> The value for APPROXIMATE_INPUT_RECORDS should be obtained using the enum 
> name instead of static string. Once Tez release is done with the specific 
> information we should change it to 
> org.apache.tez.common.counters.TaskCounter.APPROXIMATE_INPUT_RECORDS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24037) Parallelize hash table constructions in map joins

2020-08-13 Thread Ramesh Kumar Thangarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramesh Kumar Thangarajan reassigned HIVE-24037:
---


> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore

2020-08-13 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-23938:

Attachment: gc_2020-07-29-12.jdk8.log

> LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used 
> anymore
> 
>
> Key: HIVE-23938
> URL: https://issues.apache.org/jira/browse/HIVE-23938
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: gc_2020-07-27-13.log, gc_2020-07-29-12.jdk8.log
>
>
> https://github.com/apache/hive/blob/master/llap-server/bin/runLlapDaemon.sh#L55
> {code}
> JAVA_OPTS_BASE="-server -Djava.net.preferIPv4Stack=true -XX:+UseNUMA 
> -XX:+PrintGCDetails -verbose:gc -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M -XX:+PrintGCDateStamps"
> {code}
> on JDK11 I got something like:
> {code}
> + exec /usr/lib/jvm/jre-11-openjdk/bin/java -Dproc_llapdaemon -Xms32000m 
> -Xmx64000m -Dhttp.maxConnections=17 -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA 
> -XX:+AggressiveOpts -XX:MetaspaceSize=1024m 
> -XX:InitiatingHeapOccupancyPercent=80 -XX:MaxGCPauseMillis=200 
> -XX:+PreserveFramePointer -XX:AllocatePrefetchStyle=2 
> -Dhttp.maxConnections=10 -Dasync.profiler.home=/grid/0/async-profiler -server 
> -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+PrintGCDetails -verbose:gc 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M 
> -XX:+PrintGCDateStamps 
> -Xloggc:/grid/2/yarn/container-logs/application_1595375468459_0113/container_e26_1595375468459_0113_01_09/gc_2020-07-27-12.log
>  
> ... 
> org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon
> OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in 
> version 11.0 and will likely be removed in a future release.
> Unrecognized VM option 'UseGCLogFileRotation'
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> These are not valid in JDK11:
> {code}
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles
> -XX:GCLogFileSize
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps
> {code}
> Instead something like:
> {code}
> -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=4,filesize=100M
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-20593) Load Data for partitioned ACID tables fails with bucketId out of range: -1

2020-08-13 Thread Bernard (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176849#comment-17176849
 ] 

Bernard commented on HIVE-20593:


Hi,

Is the a workaround for this one without updating Hive?
We've tried recreating the table but we're still getting this error.

Thanks,
Bernard

> Load Data for partitioned ACID tables fails with bucketId out of range: -1
> --
>
> Key: HIVE-20593
> URL: https://issues.apache.org/jira/browse/HIVE-20593
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>Priority: Major
> Fix For: 4.0.0, 3.2.0, 3.1.2
>
> Attachments: HIVE-20593.1.patch, HIVE-20593.2.patch, 
> HIVE-20593.3.patch
>
>
> Load data for ACID tables is failing to load ORC files when it is converted 
> to IAS job.
>  
> The tempTblObj is inherited from target table. However, the only table 
> property which needs to be inherited is bucketing version. Properties like 
> transactional etc should be ignored.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-13 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated HIVE-18284:
-
Environment: (was: EMR)

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> --
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> *(non-vectorized , non-llap mode)*
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.sort.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>   ... 17 more
> {code}



--
This message was sent by Atlassian Jira

[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-13 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated HIVE-18284:
-
Description: 
A Null Pointer Exception occurs when inserting data with 'distribute by' 
clause. The following snippet query reproduces this issue:
*(non-vectorized , non-llap mode)*

{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.vectorized.execution.enabled=false;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

I could run the insert query without the error if I remove Distribute By  or 
use Cluster By clause.
It seems that the issue happens because Distribute By does not guarantee 
clustering or sorting properties on the distributed keys.

FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
previous fsp which might be re-used when we use Distribute By.
https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972

The following stack trace is logged.

{code:java}
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
failure ) : 
attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
... 17 more
{code}



  was:
A Null Pointer Exception occurs when inserting data with 'distribute by' 
clause. The following snippet query reproduces this issue:


{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.vectorized.execution.enabled=false;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

I could run the insert query without the error if I remove Distribute By  or 
use Cluster By clause.
It seems that the issue happens because Distribute By does not guarantee 
clustering or sorting properties on the 

[jira] [Commented] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-13 Thread Syed Shameerur Rahman (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176832#comment-17176832
 ] 

Syed Shameerur Rahman commented on HIVE-18284:
--

*PR:* https://github.com/apache/hive/pull/1400
[~jcamachorodriguez] can you please review?

Thanks!

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> --
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
> Environment: EMR
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>   ... 17 more
> {code}



--
This message was sent by Atlassian Jira

[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-13 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated HIVE-18284:
-
Description: 
A Null Pointer Exception occurs when inserting data with 'distribute by' 
clause. The following snippet query reproduces this issue:


{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.vectorized.execution.enabled=false;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

I could run the insert query without the error if I remove Distribute By  or 
use Cluster By clause.
It seems that the issue happens because Distribute By does not guarantee 
clustering or sorting properties on the distributed keys.

FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
previous fsp which might be re-used when we use Distribute By.
https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972

The following stack trace is logged.

{code:java}
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
failure ) : 
attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
... 17 more
{code}



  was:
A Null Pointer Exception occurs when inserting data with 'distribute by' 
clause. The following snippet query reproduces this issue:


{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

I could run the insert query without the error if I remove Distribute By  or 
use Cluster By clause.
It seems that the issue happens because Distribute By does not guarantee 
clustering or sorting properties on the distributed keys.

FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
previous fsp which might be 

[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-13 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated HIVE-18284:
-
Summary: NPE when inserting data with 'distribute by' clause with dynpart 
sort optimization  (was: NPE when inserting data with 'distribute by' clause)

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> --
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
> Environment: EMR
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>   ... 17 more
> {code}



--
This message was sent by Atlassian 

[jira] [Assigned] (HIVE-18284) NPE when inserting data with 'distribute by' clause

2020-08-13 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman reassigned HIVE-18284:


Assignee: Syed Shameerur Rahman  (was: Lynch Lee)

> NPE when inserting data with 'distribute by' clause
> ---
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
> Environment: EMR
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>   ... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-18284:
--
Labels: pull-request-available  (was: )

> NPE when inserting data with 'distribute by' clause
> ---
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
> Environment: EMR
>Reporter: Aki Tanaka
>Assignee: Lynch Lee
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>   ... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=470133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470133
 ]

ASF GitHub Bot logged work on HIVE-18284:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 08:18
Start Date: 13/Aug/20 08:18
Worklog Time Spent: 10m 
  Work Description: shameersss1 opened a new pull request #1400:
URL: https://github.com/apache/hive/pull/1400


   …
   
   
   
   ### What changes were proposed in this pull request?
   
   when hive.optimize.sort.dynamic.partition=true we expect the keys to be 
sorted in the reducer side so that reducers can keep only one record writer 
open at any time thereby reducing the memory pressure on the reducers 
(HIVE-6455) , But in case of non-vectorizied , non-llap execution the keys are 
not sorted and fails with NPE.
   Refer: 
https://issues.apache.org/jira/browse/HIVE-18284?focusedCommentId=17173124=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17173124
   
   Caused due to: https://issues.apache.org/jira/browse/HIVE-13260
   
   ### Why are the changes needed?
   
   Changes are required in ReduceSinkDeduplication to merge properly the Child 
reduce sink operator and parent reduce sink operator
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Added a qtest



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470133)
Remaining Estimate: 0h
Time Spent: 10m

> NPE when inserting data with 'distribute by' clause
> ---
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
> Environment: EMR
>Reporter: Aki Tanaka
>Assignee: Lynch Lee
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at 

[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470126
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 07:55
Start Date: 13/Aug/20 07:55
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1280:
URL: https://github.com/apache/hive/pull/1280#discussion_r469765404



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/aggregates/VectorUDAFBloomFilterMerge.java
##
@@ -77,6 +75,211 @@ public void reset() {
   // Do not change the initial bytes which contain 
NumHashFunctions/NumBits!
   Arrays.fill(bfBytes, BloomKFilter.START_OF_SERIALIZED_LONGS, 
bfBytes.length, (byte) 0);
 }
+
+public boolean mergeBloomFilterBytesFromInputColumn(BytesColumnVector 
inputColumn,
+int batchSize, boolean selectedInUse, int[] selected, Configuration 
conf) {
+  // already set in previous iterations, no need to call initExecutor again
+  if (numThreads == 0) {
+return false;
+  }
+  if (executor == null) {
+initExecutor(conf, batchSize);
+if (!isParallel) {
+  return false;
+}
+  }
+
+  // split every bloom filter (represented by a part of a byte[]) across 
workers
+  for (int j = 0; j < batchSize; j++) {
+if (!selectedInUse && inputColumn.noNulls) {
+  splitVectorAcrossWorkers(workers, inputColumn.vector[j], 
inputColumn.start[j],
+  inputColumn.length[j]);
+} else if (!selectedInUse) {
+  if (!inputColumn.isNull[j]) {
+splitVectorAcrossWorkers(workers, inputColumn.vector[j], 
inputColumn.start[j],
+inputColumn.length[j]);
+  }
+} else if (inputColumn.noNulls) {
+  int i = selected[j];
+  splitVectorAcrossWorkers(workers, inputColumn.vector[i], 
inputColumn.start[i],
+  inputColumn.length[i]);
+} else {
+  int i = selected[j];
+  if (!inputColumn.isNull[i]) {
+splitVectorAcrossWorkers(workers, inputColumn.vector[i], 
inputColumn.start[i],
+inputColumn.length[i]);
+  }
+}
+  }
+
+  return true;
+}
+
+private void initExecutor(Configuration conf, int batchSize) {
+  numThreads = 
conf.getInt(HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.varname,
+  HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.defaultIntVal);
+  LOG.info("Number of threads used for bloom filter merge: {}", 
numThreads);
+
+  if (numThreads < 0) {
+throw new RuntimeException(
+"invalid number of threads for bloom filter merge: " + numThreads);
+  }
+  if (numThreads == 0) { // disable parallel feature
+return; // this will leave isParallel=false
+  }
+  isParallel = true;
+  executor = Executors.newFixedThreadPool(numThreads);
+
+  workers = new BloomFilterMergeWorker[numThreads];
+  for (int f = 0; f < numThreads; f++) {
+workers[f] = new BloomFilterMergeWorker(bfBytes, 0, bfBytes.length);
+  }
+
+  for (int f = 0; f < numThreads; f++) {
+executor.submit(workers[f]);
+  }
+}
+
+public int getNumberOfWaitingMergeTasks(){
+  int size = 0;
+  for (BloomFilterMergeWorker w : workers){
+size += w.queue.size();
+  }
+  return size;
+}
+
+public int getNumberOfMergingWorkers() {
+  int working = 0;
+  for (BloomFilterMergeWorker w : workers) {
+if (w.isMerging.get()) {
+  working += 1;
+}
+  }
+  return working;
+}
+
+private static void splitVectorAcrossWorkers(BloomFilterMergeWorker[] 
workers, byte[] bytes,
+int start, int length) {
+  if (bytes == null || length == 0) {
+return;
+  }
+  /*
+   * This will split a byte[] across workers as below:
+   * let's say there are 10 workers for 7813 bytes, in this case
+   * length: 7813, elementPerBatch: 781
+   * bytes assigned to workers: inclusive lower bound, exclusive upper 
bound
+   * 1. worker: 5 -> 786
+   * 2. worker: 786 -> 1567
+   * 3. worker: 1567 -> 2348
+   * 4. worker: 2348 -> 3129
+   * 5. worker: 3129 -> 3910
+   * 6. worker: 3910 -> 4691
+   * 7. worker: 4691 -> 5472
+   * 8. worker: 5472 -> 6253
+   * 9. worker: 6253 -> 7034
+   * 10. worker: 7034 -> 7813 (last element per batch is: 779)
+   *
+   * This way, a particular worker will be given with the same part
+   * of all bloom filters along with the shared base bloom filter,
+   * so the bitwise OR function will not be a subject of threading/sync 
issues.
+   */
+  int elementPerBatch =
+  (int) Math.ceil((double) (length - 

[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470123=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470123
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 07:50
Start Date: 13/Aug/20 07:50
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1280:
URL: https://github.com/apache/hive/pull/1280#discussion_r469762209



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/aggregates/VectorUDAFBloomFilterMerge.java
##
@@ -77,6 +75,211 @@ public void reset() {
   // Do not change the initial bytes which contain 
NumHashFunctions/NumBits!
   Arrays.fill(bfBytes, BloomKFilter.START_OF_SERIALIZED_LONGS, 
bfBytes.length, (byte) 0);
 }
+
+public boolean mergeBloomFilterBytesFromInputColumn(BytesColumnVector 
inputColumn,
+int batchSize, boolean selectedInUse, int[] selected, Configuration 
conf) {
+  // already set in previous iterations, no need to call initExecutor again
+  if (numThreads == 0) {
+return false;
+  }
+  if (executor == null) {
+initExecutor(conf, batchSize);
+if (!isParallel) {
+  return false;
+}
+  }
+
+  // split every bloom filter (represented by a part of a byte[]) across 
workers
+  for (int j = 0; j < batchSize; j++) {
+if (!selectedInUse && inputColumn.noNulls) {
+  splitVectorAcrossWorkers(workers, inputColumn.vector[j], 
inputColumn.start[j],
+  inputColumn.length[j]);
+} else if (!selectedInUse) {
+  if (!inputColumn.isNull[j]) {
+splitVectorAcrossWorkers(workers, inputColumn.vector[j], 
inputColumn.start[j],
+inputColumn.length[j]);
+  }
+} else if (inputColumn.noNulls) {
+  int i = selected[j];
+  splitVectorAcrossWorkers(workers, inputColumn.vector[i], 
inputColumn.start[i],
+  inputColumn.length[i]);
+} else {
+  int i = selected[j];
+  if (!inputColumn.isNull[i]) {
+splitVectorAcrossWorkers(workers, inputColumn.vector[i], 
inputColumn.start[i],
+inputColumn.length[i]);
+  }
+}
+  }
+
+  return true;
+}
+
+private void initExecutor(Configuration conf, int batchSize) {
+  numThreads = 
conf.getInt(HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.varname,
+  HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.defaultIntVal);
+  LOG.info("Number of threads used for bloom filter merge: {}", 
numThreads);
+
+  if (numThreads < 0) {
+throw new RuntimeException(
+"invalid number of threads for bloom filter merge: " + numThreads);
+  }
+  if (numThreads == 0) { // disable parallel feature
+return; // this will leave isParallel=false
+  }
+  isParallel = true;
+  executor = Executors.newFixedThreadPool(numThreads);
+
+  workers = new BloomFilterMergeWorker[numThreads];
+  for (int f = 0; f < numThreads; f++) {
+workers[f] = new BloomFilterMergeWorker(bfBytes, 0, bfBytes.length);
+  }
+
+  for (int f = 0; f < numThreads; f++) {
+executor.submit(workers[f]);
+  }
+}
+
+public int getNumberOfWaitingMergeTasks(){
+  int size = 0;
+  for (BloomFilterMergeWorker w : workers){
+size += w.queue.size();
+  }
+  return size;
+}
+
+public int getNumberOfMergingWorkers() {
+  int working = 0;
+  for (BloomFilterMergeWorker w : workers) {
+if (w.isMerging.get()) {
+  working += 1;
+}
+  }
+  return working;
+}
+
+private static void splitVectorAcrossWorkers(BloomFilterMergeWorker[] 
workers, byte[] bytes,
+int start, int length) {
+  if (bytes == null || length == 0) {
+return;
+  }
+  /*
+   * This will split a byte[] across workers as below:
+   * let's say there are 10 workers for 7813 bytes, in this case
+   * length: 7813, elementPerBatch: 781
+   * bytes assigned to workers: inclusive lower bound, exclusive upper 
bound
+   * 1. worker: 5 -> 786
+   * 2. worker: 786 -> 1567
+   * 3. worker: 1567 -> 2348
+   * 4. worker: 2348 -> 3129
+   * 5. worker: 3129 -> 3910
+   * 6. worker: 3910 -> 4691
+   * 7. worker: 4691 -> 5472
+   * 8. worker: 5472 -> 6253
+   * 9. worker: 6253 -> 7034
+   * 10. worker: 7034 -> 7813 (last element per batch is: 779)
+   *
+   * This way, a particular worker will be given with the same part
+   * of all bloom filters along with the shared base bloom filter,
+   * so the bitwise OR function will not be a subject of threading/sync 
issues.
+   */
+  int elementPerBatch =
+  (int) Math.ceil((double) (length - 

[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470122=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470122
 ]

ASF GitHub Bot logged work on HIVE-23880:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 07:49
Start Date: 13/Aug/20 07:49
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1280:
URL: https://github.com/apache/hive/pull/1280#discussion_r469761824



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##
@@ -252,6 +258,13 @@ protected VectorAggregationBufferRow 
allocateAggregationBuffer() throws HiveExce
   return bufferSet;
 }
 
+protected void finishAggregators(boolean aborted) {

Review comment:
   I'll take care of this in next patch





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470122)
Time Spent: 6h 50m  (was: 6h 40m)

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 436 465 696 bits, so 
> merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR 
> operation, which is very hot codepath, but can be parallelized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23927) Cast to Timestamp generates different output for Integer & Float values

2020-08-13 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176800#comment-17176800
 ] 

Renukaprasad C commented on HIVE-23927:
---

Thanks [~jcamachorodriguez] & [~pgaref].
We will do the similar implementation as other integer datatype conversion (As 
suggested by [~pgaref] -Maybe we should make this configurable as well – as we 
do in longToTimestamp method) in 
*PrimitiveObjectInspectorUtils.getTimestamp(Object, PrimitiveObjectInspector, 
boolean).*


> Cast to Timestamp generates different output for Integer & Float values 
> 
>
> Key: HIVE-23927
> URL: https://issues.apache.org/jira/browse/HIVE-23927
> Project: Hive
>  Issue Type: Bug
>Reporter: Renukaprasad C
>Priority: Major
>
> Double consider the input value as SECOND and converts into Millis internally.
> Whereas, Integer value will be considered as Millis and produce different 
> output.
> org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getTimestamp(Object,
>  PrimitiveObjectInspector, boolean) - Handles Integral & Decimal values 
> differently. This cause the issue.
> 0: jdbc:hive2://localhost:1> select cast(1.204135216E9 as timestamp) 
> Double2TimeStamp, cast(1204135216 as timestamp) Int2TimeStamp from abc 
> tablesample(1 rows);
> OK
> INFO  : Compiling 
> command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14): 
> select cast(1.204135216E9 as timestamp) Double2TimeStamp, cast(1204135216 as 
> timestamp) Int2TimeStamp from abc tablesample(1 rows)
> INFO  : Concurrency mode is disabled, not creating a lock manager
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Returning Hive schema: 
> Schema(fieldSchemas:[FieldSchema(name:double2timestamp, type:timestamp, 
> comment:null), FieldSchema(name:int2timestamp, type:timestamp, 
> comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14); 
> Time taken: 0.175 seconds
> INFO  : Concurrency mode is disabled, not creating a lock manager
> INFO  : Executing 
> command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14): 
> select cast(1.204135216E9 as timestamp) Double2TimeStamp, cast(1204135216 as 
> timestamp) Int2TimeStamp from abc tablesample(1 rows)
> INFO  : Completed executing 
> command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14); 
> Time taken: 0.001 seconds
> INFO  : OK
> INFO  : Concurrency mode is disabled, not creating a lock manager
> ++--+
> |double2timestamp|  int2timestamp   |
> ++--+
> | 2008-02-27 18:00:16.0  | 1970-01-14 22:28:55.216  |
> ++--+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)