[jira] [Updated] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-24041:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master.

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused columns 
> from its left input, which leads to additional conversion opportunities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?focusedWorklogId=474674=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474674
 ]

ASF GitHub Bot logged work on HIVE-24041:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 05:47
Start Date: 26/Aug/20 05:47
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1405:
URL: https://github.com/apache/hive/pull/1405


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474674)
Time Spent: 1h  (was: 50m)

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused columns 
> from its left input, which leads to additional conversion opportunities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-22782) Consolidate metastore call to fetch constraints

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22782?focusedWorklogId=474669=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474669
 ]

ASF GitHub Bot logged work on HIVE-22782:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 05:31
Start Date: 26/Aug/20 05:31
Worklog Time Spent: 10m 
  Work Description: ashish-kumar-sharma commented on a change in pull 
request #1419:
URL: https://github.com/apache/hive/pull/1419#discussion_r477043530



##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##
@@ -11286,6 +11287,20 @@ private String getPrimaryKeyConstraintName(String 
catName, String dbName, String
 return notNullConstraints;
   }
 
+  @Override
+  public SQLAllTableConstraints getAllTableConstraints(String catName, String 
db_name, String tbl_name)
+  throws MetaException {
+debugLog("Get all table constraints for the table - " + catName + "."+ 
db_name+"."+tbl_name + " in class ObjectStore.java");
+SQLAllTableConstraints sqlAllTableConstraints = new 
SQLAllTableConstraints();

Review comment:
   https://issues.apache.org/jira/browse/HIVE-24062





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474669)
Time Spent: 50m  (was: 40m)

> Consolidate metastore call to fetch constraints
> ---
>
> Key: HIVE-22782
> URL: https://issues.apache.org/jira/browse/HIVE-22782
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Ashish Sharma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently separate calls are made to metastore to fetch constraints like Pk, 
> fk, not null etc. Since planner always retrieve these constraints we should 
> retrieve all of them in one call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-22782) Consolidate metastore call to fetch constraints

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22782?focusedWorklogId=474668=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474668
 ]

ASF GitHub Bot logged work on HIVE-22782:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 05:28
Start Date: 26/Aug/20 05:28
Worklog Time Spent: 10m 
  Work Description: adesh-rao commented on a change in pull request #1419:
URL: https://github.com/apache/hive/pull/1419#discussion_r477042339



##
File path: 
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestGetAllTableConstraints.java
##
@@ -0,0 +1,145 @@
+package org.apache.hadoop.hive.metastore.client;
+
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.MetaStoreTestUtils;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest;
+import org.apache.hadoop.hive.metastore.api.AllTableConstraintsRequest;
+import org.apache.hadoop.hive.metastore.api.Catalog;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.PrimaryKeysRequest;
+import org.apache.hadoop.hive.metastore.api.SQLAllTableConstraints;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.CatalogBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.TableBuilder;
+import org.apache.hadoop.hive.metastore.minihms.AbstractMetaStoreService;
+import org.apache.thrift.TException;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_DATABASE_NAME;
+
+@RunWith(Parameterized.class)
+@Category(MetastoreCheckinTest.class)
+public class TestGetAllTableConstraints extends MetaStoreClientTest {
+  private static final String OTHER_DATABASE = 
"test_constraints_other_database";
+  private static final String OTHER_CATALOG = "test_constraints_other_catalog";
+  private static final String DATABASE_IN_OTHER_CATALOG = 
"test_constraints_database_in_other_catalog";
+  private final AbstractMetaStoreService metaStore;
+  private IMetaStoreClient client;
+  private Table[] testTables = new Table[3];
+  private Database inOtherCatalog;
+
+  public TestGetAllTableConstraints(String name, AbstractMetaStoreService 
metaStore) throws Exception {
+this.metaStore = metaStore;
+  }
+  @Before
+  public void setUp() throws Exception {
+// Get new client
+client = metaStore.getClient();
+
+// Clean up the database
+client.dropDatabase(OTHER_DATABASE, true, true, true);
+// Drop every table in the default database
+for(String tableName : client.getAllTables(DEFAULT_DATABASE_NAME)) {
+  client.dropTable(DEFAULT_DATABASE_NAME, tableName, true, true, true);
+}
+
+client.dropDatabase(OTHER_CATALOG, DATABASE_IN_OTHER_CATALOG, true, true, 
true);
+try {
+  client.dropCatalog(OTHER_CATALOG);
+} catch (NoSuchObjectException e) {
+  // NOP
+}
+
+// Clean up trash
+metaStore.cleanWarehouseDirs();
+
+new DatabaseBuilder().setName(OTHER_DATABASE).create(client, 
metaStore.getConf());
+
+Catalog cat = new CatalogBuilder()
+.setName(OTHER_CATALOG)
+.setLocation(MetaStoreTestUtils.getTestWarehouseDir(OTHER_CATALOG))
+.build();
+client.createCatalog(cat);
+
+// For this one don't specify a location to make sure it gets put in the 
catalog directory
+inOtherCatalog = new DatabaseBuilder()
+.setName(DATABASE_IN_OTHER_CATALOG)
+.setCatalogName(OTHER_CATALOG)
+.create(client, metaStore.getConf());
+
+testTables[0] =
+new TableBuilder()
+.setTableName("test_table_1")
+.addCol("col1", "int")
+.addCol("col2", "varchar(32)")
+.create(client, metaStore.getConf());
+
+testTables[1] =
+new TableBuilder()
+.setDbName(OTHER_DATABASE)
+.setTableName("test_table_2")
+.addCol("col1", "int")
+.addCol("col2", "varchar(32)")
+.create(client, metaStore.getConf());
+
+testTables[2] =
+new TableBuilder()
+.inDb(inOtherCatalog)
+.setTableName("test_table_3")
+.addCol("col1", "int")
+.addCol("col2", "varchar(32)")
+.create(client, metaStore.getConf());
+
+// Reload tables from the MetaStore
+for(int i=0; i < testTables.length; i++) {
+  testTables[i] = client.getTable(testTables[i].getCatName(), 
testTables[i].getDbName(),
+  

[jira] [Work logged] (HIVE-24035) Add Jenkinsfile for branch-2.3

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24035?focusedWorklogId=474663=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474663
 ]

ASF GitHub Bot logged work on HIVE-24035:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 05:12
Start Date: 26/Aug/20 05:12
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #1398:
URL: https://github.com/apache/hive/pull/1398#issuecomment-680647781


   sorry, I missed your previous message
   
   did that exception happens all the time? i've alao seen it a few 
timesbut it seemed intermittent
   So far i had no time to investigate it further
   
   I'll restart tests to see if it passes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474663)
Time Spent: 0.5h  (was: 20m)

> Add Jenkinsfile for branch-2.3
> --
>
> Key: HIVE-24035
> URL: https://issues.apache.org/jira/browse/HIVE-24035
> Project: Hive
>  Issue Type: Test
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To enable precommit tests for github PR, we need to have a Jenkinsfile in the 
> repo. This is already done for master and branch-2. This adds the same for 
> branch-2.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24067) TestReplicationScenariosExclusiveReplica - Wrong FS error during DB drop

2020-08-25 Thread Pravin Sinha (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-24067:

Attachment: HIVE-24067.02.patch

> TestReplicationScenariosExclusiveReplica - Wrong FS error during DB drop
> 
>
> Key: HIVE-24067
> URL: https://issues.apache.org/jira/browse/HIVE-24067
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24067.01.patch, HIVE-24067.02.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In TestReplicationScenariosExclusiveReplica during drop database operation 
> for primary db, it leads to wrong FS error as the ReplChangeManager is 
> associated with replica FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24074) Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-24074:
---
Description: The timezone conversion for Parquet and Avro uses new 
{{java.time.\*}} classes, which can lead to incorrect values returned for 
certain dates in certain timezones if timestamp was computed and converted 
based on {{java.sql.\*}} classes. For instance, the offset used for Singapore 
timezone in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that 
date should be UTC+6:55:25. Some additional information can be found here: 
https://stackoverflow.com/a/52152315  (was: The timezone conversion for Parquet 
and Avro uses new {{java.time.*}} classes, which can lead to incorrect values 
returned for certain dates in certain timezones if timestamp was computed and 
converted based on {{java.sql.*}} classes. For instance, the offset used for 
Singapore timezone in 1900-01-01T00:00:00.000 is UTC+8, while the correct 
offset for that date should be UTC+6:55:25. Some additional information can be 
found here: https://stackoverflow.com/a/52152315)

> Incorrect handling of timestamp in Parquet/Avro when written in certain time 
> zones in versions before Hive 3.x
> --
>
> Key: HIVE-24074
> URL: https://issues.apache.org/jira/browse/HIVE-24074
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Parquet
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>
> The timezone conversion for Parquet and Avro uses new {{java.time.\*}} 
> classes, which can lead to incorrect values returned for certain dates in 
> certain timezones if timestamp was computed and converted based on 
> {{java.sql.\*}} classes. For instance, the offset used for Singapore timezone 
> in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that date 
> should be UTC+6:55:25. Some additional information can be found here: 
> https://stackoverflow.com/a/52152315



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24074) Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez reassigned HIVE-24074:
--


> Incorrect handling of timestamp in Parquet/Avro when written in certain time 
> zones in versions before Hive 3.x
> --
>
> Key: HIVE-24074
> URL: https://issues.apache.org/jira/browse/HIVE-24074
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Parquet
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>
> The timezone conversion for Parquet and Avro uses new {{java.time.*}} 
> classes, which can lead to incorrect values returned for certain dates in 
> certain timezones if timestamp was computed and converted based on 
> {{java.sql.*}} classes. For instance, the offset used for Singapore timezone 
> in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that date 
> should be UTC+6:55:25. Some additional information can be found here: 
> https://stackoverflow.com/a/52152315



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?focusedWorklogId=474585=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474585
 ]

ASF GitHub Bot logged work on HIVE-24061:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 00:46
Start Date: 26/Aug/20 00:46
Worklog Time Spent: 10m 
  Work Description: rbalamohan commented on pull request #1431:
URL: https://github.com/apache/hive/pull/1431#issuecomment-680377420


   Thanks @prasanthj . Made a minor fix, where "isClusterCapacityFull" has to 
be reset in trySchedulingPendingTasks as well. It is needed, as we need to 
ensure that scheduling opportunity is given during task deallocations, node 
addition etc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474585)
Time Spent: 40m  (was: 0.5h)

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> request.shouldDelayForLocality(schedulerAttemptTime);
> ..
> ..
> boolean shouldDelayForLocality(long schedulerAttemptTime) {
>   return localityDelayTimeout > schedulerAttemptTime;
> }
> {code}
>  
> Ideally, "localityDelayTimeout" should be adjusted based on it's first 
> scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23700) HiveConf static initialization fails when JAR URI is opaque

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23700?focusedWorklogId=474581=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474581
 ]

ASF GitHub Bot logged work on HIVE-23700:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 00:40
Start Date: 26/Aug/20 00:40
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #1139:
URL: https://github.com/apache/hive/pull/1139


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474581)
Remaining Estimate: 118h 20m  (was: 118.5h)
Time Spent: 1h 40m  (was: 1.5h)

> HiveConf static initialization fails when JAR URI is opaque
> ---
>
> Key: HIVE-23700
> URL: https://issues.apache.org/jira/browse/HIVE-23700
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.7
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HIVE-23700.1.patch
>
>   Original Estimate: 120h
>  Time Spent: 1h 40m
>  Remaining Estimate: 118h 20m
>
> HiveConf static initialization fails when the jar URI is opaque, for example 
> when it's embedded as a fat jar in a spring boot application. Then 
> initialization of the HiveConf static block fails and the HiveConf class does 
> not get classloaded. The opaque URI in my case looks like this 
> _jar:file:/usr/local/server/some-service-jar.jar!/BOOT-INF/lib/hive-common-2.3.7.jar!/_
> HiveConf#findConfigFile should be able to handle `IllegalArgumentException` 
> when the jar `URI` provided to `File` throws the exception.
> To surface this issue three conditions need to be met.
> 1. hive-site.xml should not be on the classpath
> 2. hive-site.xml should not be on "HIVE_CONF_DIR"
> 3. hive-site.xml should not be on "HIVE_HOME"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23700) HiveConf static initialization fails when JAR URI is opaque

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23700?focusedWorklogId=474580=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474580
 ]

ASF GitHub Bot logged work on HIVE-23700:
-

Author: ASF GitHub Bot
Created on: 26/Aug/20 00:40
Start Date: 26/Aug/20 00:40
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #1138:
URL: https://github.com/apache/hive/pull/1138


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474580)
Remaining Estimate: 118.5h  (was: 118h 40m)
Time Spent: 1.5h  (was: 1h 20m)

> HiveConf static initialization fails when JAR URI is opaque
> ---
>
> Key: HIVE-23700
> URL: https://issues.apache.org/jira/browse/HIVE-23700
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.7
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HIVE-23700.1.patch
>
>   Original Estimate: 120h
>  Time Spent: 1.5h
>  Remaining Estimate: 118.5h
>
> HiveConf static initialization fails when the jar URI is opaque, for example 
> when it's embedded as a fat jar in a spring boot application. Then 
> initialization of the HiveConf static block fails and the HiveConf class does 
> not get classloaded. The opaque URI in my case looks like this 
> _jar:file:/usr/local/server/some-service-jar.jar!/BOOT-INF/lib/hive-common-2.3.7.jar!/_
> HiveConf#findConfigFile should be able to handle `IllegalArgumentException` 
> when the jar `URI` provided to `File` throws the exception.
> To surface this issue three conditions need to be met.
> 1. hive-site.xml should not be on the classpath
> 2. hive-site.xml should not be on "HIVE_CONF_DIR"
> 3. hive-site.xml should not be on "HIVE_HOME"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?focusedWorklogId=474542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474542
 ]

ASF GitHub Bot logged work on HIVE-24061:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 22:22
Start Date: 25/Aug/20 22:22
Worklog Time Spent: 10m 
  Work Description: rbalamohan commented on pull request #1431:
URL: https://github.com/apache/hive/pull/1431#issuecomment-680298633


   Thanks @prasanthj . Updated the patch with review comments addressed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474542)
Time Spent: 0.5h  (was: 20m)

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> request.shouldDelayForLocality(schedulerAttemptTime);
> ..
> ..
> boolean shouldDelayForLocality(long schedulerAttemptTime) {
>   return localityDelayTimeout > schedulerAttemptTime;
> }
> {code}
>  
> Ideally, "localityDelayTimeout" should be adjusted based on it's first 
> scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24035) Add Jenkinsfile for branch-2.3

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24035?focusedWorklogId=474538=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474538
 ]

ASF GitHub Bot logged work on HIVE-24035:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 21:58
Start Date: 25/Aug/20 21:58
Worklog Time Spent: 10m 
  Work Description: sunchao commented on pull request #1398:
URL: https://github.com/apache/hive/pull/1398#issuecomment-680290206


   kindly ping @kgyrtkirk - wonder if you have any idea on the jenkins issue. 
Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474538)
Time Spent: 20m  (was: 10m)

> Add Jenkinsfile for branch-2.3
> --
>
> Key: HIVE-24035
> URL: https://issues.apache.org/jira/browse/HIVE-24035
> Project: Hive
>  Issue Type: Test
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To enable precommit tests for github PR, we need to have a Jenkinsfile in the 
> repo. This is already done for master and branch-2. This adds the same for 
> branch-2.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24073) Execution exception in sort-merge semijoin

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez reassigned HIVE-24073:
--


> Execution exception in sort-merge semijoin
> --
>
> Key: HIVE-24073
> URL: https://issues.apache.org/jira/browse/HIVE-24073
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Reporter: Jesus Camacho Rodriguez
>Assignee: mahesh kumar behera
>Priority: Major
>
> Working on HIVE-24001, we trigger an additional SJ conversion that leads to 
> this exception at execution time:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1063)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:685)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:462)
>   ... 16 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1037)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1060)
>   ... 22 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to 
> overwrite nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:564)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:243)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:887)
>   at 
> org.apache.hadoop.hive.ql.exec.TezDummyStoreOperator.process(TezDummyStoreOperator.java:49)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:887)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1003)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1020)
>   ... 23 more
> {code}
> To reproduce, just set {{hive.auto.convert.sortmerge.join}} to {{true}} in 
> the last query in {{auto_sortmerge_join_10.q}} after HIVE-24041 has been 
> merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?focusedWorklogId=474534=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474534
 ]

ASF GitHub Bot logged work on HIVE-24041:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 21:44
Start Date: 25/Aug/20 21:44
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #1405:
URL: https://github.com/apache/hive/pull/1405#issuecomment-680285026


   @kasakrisz , I have addressed the comment and all tests passed. I'll link 
the JIRA for semijoin SMB to this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474534)
Time Spent: 50m  (was: 40m)

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused columns 
> from its left input, which leads to additional conversion opportunities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?focusedWorklogId=474425=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474425
 ]

ASF GitHub Bot logged work on HIVE-24061:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 17:45
Start Date: 25/Aug/20 17:45
Worklog Time Spent: 10m 
  Work Description: prasanthj commented on a change in pull request #1431:
URL: https://github.com/apache/hive/pull/1431#discussion_r476626802



##
File path: 
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
##
@@ -1487,6 +1492,14 @@ private SelectHostResult selectHost(TaskInfo request) {
 return SELECT_HOST_RESULT_DELAYED_RESOURCES;
   }
 
+  // When all nodes are busy, reset locality delay
+  if (activeNodesWithFreeSlots.isEmpty()) {
+isCapacityFull.set(true);
+if (request.localityDelayTimeout > 0 && 
isRequestedHostPresent(request)) {
+  request.resetLocalityDelayInfo();
+}
+  }

Review comment:
   isCapacityFull can be set to false in else condition here? instead of 
trySchedulingPendingTasks.. just a minor readability improvement.. 

##
File path: 
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
##
@@ -1817,8 +1830,10 @@ protected void schedulePendingTasks() throws 
InterruptedException {
 Iterator taskIter = taskListAtPriority.iterator();
 boolean scheduledAllAtPriority = true;
 while (taskIter.hasNext()) {
-  // TODO Optimization: Add a check to see if there's any capacity 
available. No point in
-  // walking through all active nodes, if they don't have potential 
capacity.
+  // Early exit where there are no slots available
+  if (isCapacityFull.get()) {

Review comment:
   This can be outer if condition? only if cluster capacity is available 
run the task iterator. 

##
File path: 
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
##
@@ -1390,7 +1395,7 @@ private SelectHostResult selectHost(TaskInfo request) {
   boolean shouldDelayForLocality = 
request.shouldDelayForLocality(schedulerAttemptTime);
   LOG.debug("ShouldDelayForLocality={} for task={} on hosts={}", 
shouldDelayForLocality,
   request.task, requestedHostsDebugStr);
-  if (requestedHosts != null && requestedHosts.length > 0) {
+  if (!isRequestedHostPresent(request)) {

Review comment:
   the condition seems to have flipped here. is this expected?

##
File path: 
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
##
@@ -251,6 +251,7 @@ public void setError(Void v, Throwable t) {
 
   private final Lock scheduleLock = new ReentrantLock();
   private final Condition scheduleCondition = scheduleLock.newCondition();
+  private final AtomicBoolean isCapacityFull = new AtomicBoolean(false);

Review comment:
   nit: to differentiate node vs cluster, rename this to 
isClusterCapacityFull? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474425)
Time Spent: 20m  (was: 10m)

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> 

[jira] [Work logged] (HIVE-24068) Add re-execution plugin for handling DAG submission and unmanaged AM failures

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24068?focusedWorklogId=474399=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474399
 ]

ASF GitHub Bot logged work on HIVE-24068:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 17:05
Start Date: 25/Aug/20 17:05
Worklog Time Spent: 10m 
  Work Description: prasanthj commented on pull request #1428:
URL: https://github.com/apache/hive/pull/1428#issuecomment-680152596


   @kgyrtkirk thanks for the review! Addressed the review comment. Also handled 
the maxExecutions within the plugin which was missing before. Once minor change 
added is to update AM loss plugin to handle unmanaged AM failure as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474399)
Time Spent: 0.5h  (was: 20m)

> Add re-execution plugin for handling DAG submission and unmanaged AM failures
> -
>
> Key: HIVE-24068
> URL: https://issues.apache.org/jira/browse/HIVE-24068
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> DAG submission failure can also happen in environments where AM container 
> died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't 
> started execution yet. There are retries at getSession and submitDAG level 
> individually but some submitDAG failure has to retry getSession as well as AM 
> could be unreachable, this can be handled in re-execution plugin.
> There is already AM loss retry execution plugin but it only handles managed 
> AMs. It can be extended to handle unmanaged AMs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24068) Add re-execution plugin for handling DAG submission and unmanaged AM failures

2020-08-25 Thread Prasanth Jayachandran (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-24068:
-
Description: 
DAG submission failure can also happen in environments where AM container died 
causing DNS issues. DAG submissions are safe to retry as the DAG hasn't started 
execution yet. There are retries at getSession and submitDAG level individually 
but some submitDAG failure has to retry getSession as well as AM could be 
unreachable, this can be handled in re-execution plugin.

There is already AM loss retry execution plugin but it only handles managed 
AMs. It can be extended to handle unmanaged AMs as well.

  was:DAG submission failure can also happen in environments where AM container 
died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't 
started execution yet. There are retries at getSession and submitDAG level 
individually but some submitDAG failure has to retry getSession as well as AM 
could be unreachable, this can be handled in re-execution plugin.


> Add re-execution plugin for handling DAG submission and unmanaged AM failures
> -
>
> Key: HIVE-24068
> URL: https://issues.apache.org/jira/browse/HIVE-24068
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> DAG submission failure can also happen in environments where AM container 
> died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't 
> started execution yet. There are retries at getSession and submitDAG level 
> individually but some submitDAG failure has to retry getSession as well as AM 
> could be unreachable, this can be handled in re-execution plugin.
> There is already AM loss retry execution plugin but it only handles managed 
> AMs. It can be extended to handle unmanaged AMs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24068) Add re-execution plugin for handling DAG submission and unmanaged AM failures

2020-08-25 Thread Prasanth Jayachandran (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-24068:
-
Summary: Add re-execution plugin for handling DAG submission and unmanaged 
AM failures  (was: Add re-execution plugin for handling DAG submission failures)

> Add re-execution plugin for handling DAG submission and unmanaged AM failures
> -
>
> Key: HIVE-24068
> URL: https://issues.apache.org/jira/browse/HIVE-24068
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> DAG submission failure can also happen in environments where AM container 
> died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't 
> started execution yet. There are retries at getSession and submitDAG level 
> individually but some submitDAG failure has to retry getSession as well as AM 
> could be unreachable, this can be handled in re-execution plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-20600) Metastore connection leak

2020-08-25 Thread Amine ZITOUNI (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184177#comment-17184177
 ] 

Amine ZITOUNI commented on HIVE-20600:
--

Hi,

I reproduce the issue  with  HiveServer2 & HiveServer2 Interactive (LLAP).

The Hive stack version is 3.1.0 (used via Hortonworks HDP 3.1.4.0-315).

 

I run 8 queries per minute. The number of connections increases each time a 
query is launched.

Netstat shows a thousands of connection from HS2 to Hive metastore with 
CLOSE_WAIT status.

 

After time, HS2 is unresponsive and should be restarted.

HiveServer2 Heap Size is 16GB

Metastore Heap Size is 16GB

> Metastore connection leak
> -
>
> Key: HIVE-20600
> URL: https://issues.apache.org/jira/browse/HIVE-20600
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 2.3.3
>Reporter: Damon Cortesi
>Priority: Major
> Attachments: HIVE-20600.patch, consume_threads.py
>
>
> Within the execute method of HiveServer2, there appears to be a connection 
> leak. With fairly straightforward series of INSERT statements, the connection 
> count in the logs continues to increase over time. Under certain loads, this 
> can also consume all underlying threads of the Hive metastore and result in 
> HS2 becoming unresponsive to new connections.
> The log below is the result of some python code executing a single insert 
> statement, and then looping through a series of 10 more insert statements. We 
> can see there's one dangling connection left open after each execution 
> leaving us with 12 open connections (11 from the execute statements + 1 from 
> HS2 startup).
> {code}
> 2018-09-19T17:14:32,108 INFO [main([])]: hive.metastore 
> (HiveMetaStoreClient.java:open(481)) - Opened a connection to metastore, 
> current connections: 1
>  2018-09-19T17:14:48,175 INFO [29049f74-73c4-4f48-9cf7-b4bfe524a85b 
> HiveServer2-Handler-Pool: Thread-31([])]: hive.metastore 
> (HiveMetaStoreClient.java:open(481)) - Opened a connection to metastore, 
> current connections: 2
>  2018-09-19T17:15:05,543 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 1
>  2018-09-19T17:15:05,548 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 2
>  2018-09-19T17:15:05,932 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 1
>  2018-09-19T17:15:05,935 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 2
>  2018-09-19T17:15:06,123 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 1
>  2018-09-19T17:15:06,126 INFO [HiveServer2-Background-Pool: Thread-36([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 2
> ...
>  2018-09-19T17:15:20,626 INFO [29049f74-73c4-4f48-9cf7-b4bfe524a85b 
> HiveServer2-Handler-Pool: Thread-31([])]: hive.metastore 
> (HiveMetaStoreClient.java:open(481)) - Opened a connection to metastore, 
> current connections: 12
>  2018-09-19T17:15:21,153 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 11
>  2018-09-19T17:15:21,155 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 12
>  2018-09-19T17:15:21,306 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 11
>  2018-09-19T17:15:21,308 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 12
>  2018-09-19T17:15:21,385 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:close(564)) - Closed a connection to 
> metastore, current connections: 11
>  2018-09-19T17:15:21,387 INFO [HiveServer2-Background-Pool: Thread-162([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current connections: 12
>  2018-09-19T17:15:21,541 INFO [HiveServer2-Handler-Pool: Thread-31([])]: 
> hive.metastore (HiveMetaStoreClient.java:open(481)) - Opened a connection to 
> metastore, current 

[jira] [Commented] (HIVE-23725) ValidTxnManager snapshot outdating causing partial reads in merge insert

2020-08-25 Thread Zoltan Haindrich (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184162#comment-17184162
 ] 

Zoltan Haindrich commented on HIVE-23725:
-

* I think using hooks is better - because it could give you more context about 
the actual system's state...if some information is not yet accessible using the 
hooks please try to extend the hooks with it...so that you can be notified/etc
 * making the plugin optional could have been usefull in that case...when a 
plugin doesn't work as expected; you could disable it - but in case it's burned 
in..you can't disable it...
 * I don't know why would you need to go over say 
HIVE_QUERY_MAX_REEXECUTION_COUNT - I think if we try something for 3 times it 
will likely not succeed after further attempts as well...

 

> ValidTxnManager snapshot outdating causing partial reads in merge insert
> 
>
> Key: HIVE-23725
> URL: https://issues.apache.org/jira/browse/HIVE-23725
> Project: Hive
>  Issue Type: Bug
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the ValidTxnManager invalidates the snapshot during merge insert and 
> starts to read committed transactions that were not committed when the query 
> compilation happened, it can cause partial read problems if the committed 
> transaction created new partition in the source or target table.
> The solution should be not only fix the snapshot but also recompile the query 
> and acquire the locks again.
> You could construct an example like this:
> 1. open and compile transaction 1 that merge inserts data from a partitioned 
> source table that has a few partition.
> 2. Open, run and commit transaction 2 that inserts data to an old and a new 
> partition to the source table.
> 3. Open, run and commit transaction 3 that inserts data to the target table 
> of the merge statement, that will retrigger a snapshot generation in 
> transaction 1.
> 4. Run transaction 1, the snapshot will be regenerated, and it will read 
> partial data from transaction 2 breaking the ACID properties.
> Different setup.
> Switch the transaction order:
> 1. compile transaction 1 that inserts data to an old and a new partition of 
> the source table.
> 2. compile transaction 2 that insert data to the target table
> 2. compile transaction 3 that merge inserts data from the source table to the 
> target table
> 3. run and commit transaction 1
> 4. run and commit transaction 2
> 5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
> isValidTxnListState will be triggered and we do a partial read of the 
> transaction 1 for the same reasons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23725) ValidTxnManager snapshot outdating causing partial reads in merge insert

2020-08-25 Thread Peter Varga (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184154#comment-17184154
 ] 

Peter Varga commented on HIVE-23725:


[~kgyrtkirk] I will have to work on an upgraded version of this, because 
releasing the locks when reexecuting causing problems when the degree of 
concurrent update queries for the same partition is high. I will try to address 
your comments there and invite you to review.

Few questions, so I can go for the right direction:

 * I have seen that the other plugin implementation use hooks, but my problem 
was that this exception is thrown between the compilation and execution phase. 
Can I use the failure hook there, would the failure hook be called?
*  The whole reexecution count misery was introduced, to make it possible for 
the new plugin to have an individual config for rexec count that can be higher 
than the HIVE_QUERY_MAX_REEXECUTION_COUNT, but still keep that config for every 
other plugin. If you have an idea how to solve this properly, I happy to 
implement it, because I do not like the current solution either

> ValidTxnManager snapshot outdating causing partial reads in merge insert
> 
>
> Key: HIVE-23725
> URL: https://issues.apache.org/jira/browse/HIVE-23725
> Project: Hive
>  Issue Type: Bug
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the ValidTxnManager invalidates the snapshot during merge insert and 
> starts to read committed transactions that were not committed when the query 
> compilation happened, it can cause partial read problems if the committed 
> transaction created new partition in the source or target table.
> The solution should be not only fix the snapshot but also recompile the query 
> and acquire the locks again.
> You could construct an example like this:
> 1. open and compile transaction 1 that merge inserts data from a partitioned 
> source table that has a few partition.
> 2. Open, run and commit transaction 2 that inserts data to an old and a new 
> partition to the source table.
> 3. Open, run and commit transaction 3 that inserts data to the target table 
> of the merge statement, that will retrigger a snapshot generation in 
> transaction 1.
> 4. Run transaction 1, the snapshot will be regenerated, and it will read 
> partial data from transaction 2 breaking the ACID properties.
> Different setup.
> Switch the transaction order:
> 1. compile transaction 1 that inserts data to an old and a new partition of 
> the source table.
> 2. compile transaction 2 that insert data to the target table
> 2. compile transaction 3 that merge inserts data from the source table to the 
> target table
> 3. run and commit transaction 1
> 4. run and commit transaction 2
> 5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
> isValidTxnListState will be triggered and we do a partial read of the 
> transaction 1 for the same reasons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23649) Fix FindBug issues in hive-service-rpc

2020-08-25 Thread Zoltan Haindrich (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich updated HIVE-23649:

Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

merged to master. Thank you [~mustafaiman]!

> Fix FindBug issues in hive-service-rpc
> --
>
> Key: HIVE-23649
> URL: https://issues.apache.org/jira/browse/HIVE-23649
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Mustafa Iman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: spotbugsXml.xml
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23649) Fix FindBug issues in hive-service-rpc

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23649?focusedWorklogId=474376=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474376
 ]

ASF GitHub Bot logged work on HIVE-23649:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 15:45
Start Date: 25/Aug/20 15:45
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk merged pull request #1426:
URL: https://github.com/apache/hive/pull/1426


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474376)
Time Spent: 20m  (was: 10m)

> Fix FindBug issues in hive-service-rpc
> --
>
> Key: HIVE-23649
> URL: https://issues.apache.org/jira/browse/HIVE-23649
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Mustafa Iman
>Priority: Major
>  Labels: pull-request-available
> Attachments: spotbugsXml.xml
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24072) HiveAggregateJoinTransposeRule may try to create an invalid transformation

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24072:
--
Labels: pull-request-available  (was: )

> HiveAggregateJoinTransposeRule may try to create an invalid transformation
> --
>
> Key: HIVE-24072
> URL: https://issues.apache.org/jira/browse/HIVE-24072
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> java.lang.AssertionError: 
> Cannot add expression of different type to set:
> set type is RecordType(INTEGER NOT NULL o_orderkey, DECIMAL(10, 0) 
> o_totalprice, DATE o_orderdate, INTEGER NOT NULL c_custkey, VARCHAR(25) 
> CHARACTER SET "UTF-16LE" c_name, DOUBLE $f5) NOT NULL
> expression type is RecordType(INTEGER NOT NULL o_orderkey, INTEGER NOT NULL 
> o_custkey, DECIMAL(10, 0) o_totalprice, DATE o_orderdate, INTEGER NOT NULL 
> c_custkey, DOUBLE $f1) NOT NULL
> set is rel#567:HiveAggregate.HIVE.[].any(input=HepRelVertex#490,group={2, 4, 
> 5, 6, 7},agg#0=sum($1))
> expression is HiveProject(o_orderkey=[$2], o_custkey=[$3], o_totalprice=[$4], 
> o_orderdate=[$5], c_custkey=[$6], $f1=[$1])
>   HiveJoin(condition=[=($2, $0)], joinType=[inner], algorithm=[none], 
> cost=[{2284.5 rows, 0.0 cpu, 0.0 io}])
> HiveAggregate(group=[{0}], agg#0=[sum($1)])
>   HiveProject(l_orderkey=[$0], l_quantity=[$4])
> HiveTableScan(table=[[tpch_0_001, lineitem]], table:alias=[l])
> HiveJoin(condition=[=($0, $6)], joinType=[inner], algorithm=[none], 
> cost=[{1.9115E15 rows, 0.0 cpu, 0.0 io}])
>   HiveJoin(condition=[=($4, $1)], joinType=[inner], algorithm=[none], 
> cost=[{1650.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(o_orderkey=[$0], o_custkey=[$1], o_totalprice=[$3], 
> o_orderdate=[$4])
>   HiveTableScan(table=[[tpch_0_001, orders]], table:alias=[orders])
> HiveProject(c_custkey=[$0], c_name=[$1])
>   HiveTableScan(table=[[tpch_0_001, customer]], 
> table:alias=[customer])
>   HiveProject($f0=[$0])
> HiveFilter(condition=[>($1, 3E2)])
>   HiveAggregate(group=[{0}], agg#0=[sum($4)])
> HiveTableScan(table=[[tpch_0_001, lineitem]], 
> table:alias=[lineitem])
>   at 
> org.apache.calcite.plan.RelOptUtil.verifyTypeEquivalence(RelOptUtil.java:383)
>   at 
> org.apache.calcite.plan.hep.HepRuleCall.transformTo(HepRuleCall.java:57)
>   at 
> org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:236)
>   at 
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveAggregateJoinTransposeRule.onMatch(HiveAggregateJoinTransposeRule.java:300)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24072) HiveAggregateJoinTransposeRule may try to create an invalid transformation

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24072?focusedWorklogId=474374=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474374
 ]

ASF GitHub Bot logged work on HIVE-24072:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 15:42
Start Date: 25/Aug/20 15:42
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk opened a new pull request #1432:
URL: https://github.com/apache/hive/pull/1432


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474374)
Remaining Estimate: 0h
Time Spent: 10m

> HiveAggregateJoinTransposeRule may try to create an invalid transformation
> --
>
> Key: HIVE-24072
> URL: https://issues.apache.org/jira/browse/HIVE-24072
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> java.lang.AssertionError: 
> Cannot add expression of different type to set:
> set type is RecordType(INTEGER NOT NULL o_orderkey, DECIMAL(10, 0) 
> o_totalprice, DATE o_orderdate, INTEGER NOT NULL c_custkey, VARCHAR(25) 
> CHARACTER SET "UTF-16LE" c_name, DOUBLE $f5) NOT NULL
> expression type is RecordType(INTEGER NOT NULL o_orderkey, INTEGER NOT NULL 
> o_custkey, DECIMAL(10, 0) o_totalprice, DATE o_orderdate, INTEGER NOT NULL 
> c_custkey, DOUBLE $f1) NOT NULL
> set is rel#567:HiveAggregate.HIVE.[].any(input=HepRelVertex#490,group={2, 4, 
> 5, 6, 7},agg#0=sum($1))
> expression is HiveProject(o_orderkey=[$2], o_custkey=[$3], o_totalprice=[$4], 
> o_orderdate=[$5], c_custkey=[$6], $f1=[$1])
>   HiveJoin(condition=[=($2, $0)], joinType=[inner], algorithm=[none], 
> cost=[{2284.5 rows, 0.0 cpu, 0.0 io}])
> HiveAggregate(group=[{0}], agg#0=[sum($1)])
>   HiveProject(l_orderkey=[$0], l_quantity=[$4])
> HiveTableScan(table=[[tpch_0_001, lineitem]], table:alias=[l])
> HiveJoin(condition=[=($0, $6)], joinType=[inner], algorithm=[none], 
> cost=[{1.9115E15 rows, 0.0 cpu, 0.0 io}])
>   HiveJoin(condition=[=($4, $1)], joinType=[inner], algorithm=[none], 
> cost=[{1650.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(o_orderkey=[$0], o_custkey=[$1], o_totalprice=[$3], 
> o_orderdate=[$4])
>   HiveTableScan(table=[[tpch_0_001, orders]], table:alias=[orders])
> HiveProject(c_custkey=[$0], c_name=[$1])
>   HiveTableScan(table=[[tpch_0_001, customer]], 
> table:alias=[customer])
>   HiveProject($f0=[$0])
> HiveFilter(condition=[>($1, 3E2)])
>   HiveAggregate(group=[{0}], agg#0=[sum($4)])
> HiveTableScan(table=[[tpch_0_001, lineitem]], 
> table:alias=[lineitem])
>   at 
> org.apache.calcite.plan.RelOptUtil.verifyTypeEquivalence(RelOptUtil.java:383)
>   at 
> org.apache.calcite.plan.hep.HepRuleCall.transformTo(HepRuleCall.java:57)
>   at 
> org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:236)
>   at 
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveAggregateJoinTransposeRule.onMatch(HiveAggregateJoinTransposeRule.java:300)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24072) HiveAggregateJoinTransposeRule may try to create an invalid transformation

2020-08-25 Thread Zoltan Haindrich (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich reassigned HIVE-24072:
---


> HiveAggregateJoinTransposeRule may try to create an invalid transformation
> --
>
> Key: HIVE-24072
> URL: https://issues.apache.org/jira/browse/HIVE-24072
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>
> {code}
> java.lang.AssertionError: 
> Cannot add expression of different type to set:
> set type is RecordType(INTEGER NOT NULL o_orderkey, DECIMAL(10, 0) 
> o_totalprice, DATE o_orderdate, INTEGER NOT NULL c_custkey, VARCHAR(25) 
> CHARACTER SET "UTF-16LE" c_name, DOUBLE $f5) NOT NULL
> expression type is RecordType(INTEGER NOT NULL o_orderkey, INTEGER NOT NULL 
> o_custkey, DECIMAL(10, 0) o_totalprice, DATE o_orderdate, INTEGER NOT NULL 
> c_custkey, DOUBLE $f1) NOT NULL
> set is rel#567:HiveAggregate.HIVE.[].any(input=HepRelVertex#490,group={2, 4, 
> 5, 6, 7},agg#0=sum($1))
> expression is HiveProject(o_orderkey=[$2], o_custkey=[$3], o_totalprice=[$4], 
> o_orderdate=[$5], c_custkey=[$6], $f1=[$1])
>   HiveJoin(condition=[=($2, $0)], joinType=[inner], algorithm=[none], 
> cost=[{2284.5 rows, 0.0 cpu, 0.0 io}])
> HiveAggregate(group=[{0}], agg#0=[sum($1)])
>   HiveProject(l_orderkey=[$0], l_quantity=[$4])
> HiveTableScan(table=[[tpch_0_001, lineitem]], table:alias=[l])
> HiveJoin(condition=[=($0, $6)], joinType=[inner], algorithm=[none], 
> cost=[{1.9115E15 rows, 0.0 cpu, 0.0 io}])
>   HiveJoin(condition=[=($4, $1)], joinType=[inner], algorithm=[none], 
> cost=[{1650.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(o_orderkey=[$0], o_custkey=[$1], o_totalprice=[$3], 
> o_orderdate=[$4])
>   HiveTableScan(table=[[tpch_0_001, orders]], table:alias=[orders])
> HiveProject(c_custkey=[$0], c_name=[$1])
>   HiveTableScan(table=[[tpch_0_001, customer]], 
> table:alias=[customer])
>   HiveProject($f0=[$0])
> HiveFilter(condition=[>($1, 3E2)])
>   HiveAggregate(group=[{0}], agg#0=[sum($4)])
> HiveTableScan(table=[[tpch_0_001, lineitem]], 
> table:alias=[lineitem])
>   at 
> org.apache.calcite.plan.RelOptUtil.verifyTypeEquivalence(RelOptUtil.java:383)
>   at 
> org.apache.calcite.plan.hep.HepRuleCall.transformTo(HepRuleCall.java:57)
>   at 
> org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:236)
>   at 
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveAggregateJoinTransposeRule.onMatch(HiveAggregateJoinTransposeRule.java:300)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23302) Create HiveJdbcDatabaseAccessor for JDBC storage handler

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-23302:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master.

> Create HiveJdbcDatabaseAccessor for JDBC storage handler
> 
>
> Key: HIVE-23302
> URL: https://issues.apache.org/jira/browse/HIVE-23302
> Project: Hive
>  Issue Type: Bug
>  Components: StorageHandler
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{JdbcDatabaseAccessor}} associated with the storage handler makes some 
> SQL calls to the RDBMS through the JDBC connection. There is a 
> {{GenericJdbcDatabaseAccessor}} with a generic implementation that the 
> storage handler uses if there is no specific implementation for a certain 
> RDBMS.
> Currently, Hive uses the {{GenericJdbcDatabaseAccessor}}. Afaik the only 
> generic query that will not work is splitting the query based on offset and 
> limit, since the syntax for that query is different than the one accepted by 
> Hive. We should create a {{HiveJdbcDatabaseAccessor}} to override that query 
> and possibly fix any other existing incompatibilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-25 Thread Peter Vary (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184120#comment-17184120
 ] 

Peter Vary commented on HIVE-24032:
---

[~aasha]: Thanks for the explanation!
Peter

> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch, HIVE-24032.02.patch, 
> HIVE-24032.03.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Remove hadoop shims dependency from standalone metastore. 
> Rename hive.repl.data.copy.lazy hive conf to 
> hive.repl.run.data.copy.tasks.on.target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23302) Create HiveJdbcDatabaseAccessor for JDBC storage handler

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23302?focusedWorklogId=474357=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474357
 ]

ASF GitHub Bot logged work on HIVE-23302:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 15:08
Start Date: 25/Aug/20 15:08
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1427:
URL: https://github.com/apache/hive/pull/1427


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474357)
Time Spent: 20m  (was: 10m)

> Create HiveJdbcDatabaseAccessor for JDBC storage handler
> 
>
> Key: HIVE-23302
> URL: https://issues.apache.org/jira/browse/HIVE-23302
> Project: Hive
>  Issue Type: Bug
>  Components: StorageHandler
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{JdbcDatabaseAccessor}} associated with the storage handler makes some 
> SQL calls to the RDBMS through the JDBC connection. There is a 
> {{GenericJdbcDatabaseAccessor}} with a generic implementation that the 
> storage handler uses if there is no specific implementation for a certain 
> RDBMS.
> Currently, Hive uses the {{GenericJdbcDatabaseAccessor}}. Afaik the only 
> generic query that will not work is splitting the query based on offset and 
> limit, since the syntax for that query is different than the one accepted by 
> Hive. We should create a {{HiveJdbcDatabaseAccessor}} to override that query 
> and possibly fix any other existing incompatibilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23725) ValidTxnManager snapshot outdating causing partial reads in merge insert

2020-08-25 Thread Jesus Camacho Rodriguez (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184112#comment-17184112
 ] 

Jesus Camacho Rodriguez commented on HIVE-23725:


[~pvargacl], could you create a follow-up to address [~kgyrtkirk] comments? It 
also seems the final version of the patch that landed is not the same one that 
I reviewed.

> ValidTxnManager snapshot outdating causing partial reads in merge insert
> 
>
> Key: HIVE-23725
> URL: https://issues.apache.org/jira/browse/HIVE-23725
> Project: Hive
>  Issue Type: Bug
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the ValidTxnManager invalidates the snapshot during merge insert and 
> starts to read committed transactions that were not committed when the query 
> compilation happened, it can cause partial read problems if the committed 
> transaction created new partition in the source or target table.
> The solution should be not only fix the snapshot but also recompile the query 
> and acquire the locks again.
> You could construct an example like this:
> 1. open and compile transaction 1 that merge inserts data from a partitioned 
> source table that has a few partition.
> 2. Open, run and commit transaction 2 that inserts data to an old and a new 
> partition to the source table.
> 3. Open, run and commit transaction 3 that inserts data to the target table 
> of the merge statement, that will retrigger a snapshot generation in 
> transaction 1.
> 4. Run transaction 1, the snapshot will be regenerated, and it will read 
> partial data from transaction 2 breaking the ACID properties.
> Different setup.
> Switch the transaction order:
> 1. compile transaction 1 that inserts data to an old and a new partition of 
> the source table.
> 2. compile transaction 2 that insert data to the target table
> 2. compile transaction 3 that merge inserts data from the source table to the 
> target table
> 3. run and commit transaction 1
> 4. run and commit transaction 2
> 5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
> isValidTxnListState will be triggered and we do a partial read of the 
> transaction 1 for the same reasons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-25 Thread Aasha Medhi (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184110#comment-17184110
 ] 

Aasha Medhi commented on HIVE-24032:


Shims dependency is removed from the standalone-metastore module. [~vihangk1] 
had done some work to make the standalone-metastore indepdent and impala also 
uses this. We had to add shims dependencies for one of the replication related 
changes. We have reverted that now with this patch. Please note this is only 
for standalone-metastore module.

Also shims were introduced to support multiple hadoop versions but as of now we 
just support one. Even with ozone the FS apis can be used directly. So that was 
another reason for not introducing shims in standalone-metastore. 


> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch, HIVE-24032.02.patch, 
> HIVE-24032.03.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Remove hadoop shims dependency from standalone metastore. 
> Rename hive.repl.data.copy.lazy hive conf to 
> hive.repl.run.data.copy.tasks.on.target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?focusedWorklogId=474347=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474347
 ]

ASF GitHub Bot logged work on HIVE-24041:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 14:26
Start Date: 25/Aug/20 14:26
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1405:
URL: https://github.com/apache/hive/pull/1405#discussion_r476491640



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSemiJoinRule.java
##
@@ -33,194 +37,263 @@
 import org.apache.calcite.rex.RexBuilder;
 import org.apache.calcite.rex.RexNode;
 import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilder.GroupKey;
 import org.apache.calcite.tools.RelBuilderFactory;
 import org.apache.calcite.util.ImmutableBitSet;
+import org.apache.calcite.util.ImmutableIntList;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
 import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import com.google.common.collect.ImmutableList;
-import com.google.common.collect.Lists;
 
 import java.util.ArrayList;
 import java.util.List;
 
 /**
- * Planner rule that creates a {@code SemiJoinRule} from a
- * {@link org.apache.calcite.rel.core.Join} on top of a
- * {@link org.apache.calcite.rel.logical.LogicalAggregate}.
- *
- * TODO Remove this rule and use Calcite's SemiJoinRule. Not possible currently
- * since Calcite doesnt use RelBuilder for this rule and we want to generate 
HiveSemiJoin rel here.
+ * Class that gathers SemiJoin conversion rules.
  */
-public abstract class HiveSemiJoinRule extends RelOptRule {
+public class HiveSemiJoinRule {
 
-  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveSemiJoinRule.class);
+  public static final HiveProjectJoinToSemiJoinRule INSTANCE_PROJECT =
+  new HiveProjectJoinToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveProjectToSemiJoinRule INSTANCE_PROJECT =
-  new HiveProjectToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveAggregateJoinToSemiJoinRule INSTANCE_AGGREGATE =
+  new HiveAggregateJoinToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveProjectToSemiJoinRuleSwapInputs 
INSTANCE_PROJECT_SWAPPED =
-  new HiveProjectToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveProjectJoinToSemiJoinRuleSwapInputs 
INSTANCE_PROJECT_SWAPPED =
+  new 
HiveProjectJoinToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveAggregateToSemiJoinRule INSTANCE_AGGREGATE =
-  new HiveAggregateToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveAggregateJoinToSemiJoinRuleSwapInputs 
INSTANCE_AGGREGATE_SWAPPED =
+  new 
HiveAggregateJoinToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);

Review comment:
   done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474347)
Time Spent: 40m  (was: 0.5h)

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused columns 
> from its left input, which leads to additional conversion opportunities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?focusedWorklogId=474345=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474345
 ]

ASF GitHub Bot logged work on HIVE-24041:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 14:23
Start Date: 25/Aug/20 14:23
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1405:
URL: https://github.com/apache/hive/pull/1405#discussion_r476489342



##
File path: ql/src/test/queries/clientpositive/auto_sortmerge_join_10.q
##
@@ -48,6 +48,8 @@ select count(*) from
   (select a.key as key, a.value as value from tbl2_n4 a where key < 6) subq2
   on subq1.key = subq2.key;
 
+set hive.auto.convert.sortmerge.join=false;

Review comment:
   Yes, this will be tackled in follow-up by @maheshk114 , there is a 
problem with execution of semijoin. I will reference the JIRA.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474345)
Time Spent: 0.5h  (was: 20m)

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused columns 
> from its left input, which leads to additional conversion opportunities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24071) Continue cleaning the NotificationEvents till we have data greater than TTL

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramesh Kumar Thangarajan reassigned HIVE-24071:
---


> Continue cleaning the NotificationEvents till we have data greater than TTL
> ---
>
> Key: HIVE-24071
> URL: https://issues.apache.org/jira/browse/HIVE-24071
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> Continue cleaning the NotificationEvents till we have data greater than TTL.
> Currently we only clean the notification events once every 2 hours and also 
> strict 1 every time. We should continue deleting until we clear up all 
> the notification events greater than TTL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23829) Compute Stats Incorrect for Binary Columns

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23829?focusedWorklogId=474332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474332
 ]

ASF GitHub Bot logged work on HIVE-23829:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 14:03
Start Date: 25/Aug/20 14:03
Worklog Time Spent: 10m 
  Work Description: belugabehr edited a comment on pull request #1313:
URL: https://github.com/apache/hive/pull/1313#issuecomment-680044199


   @HunterL Really great stuff.  Need one test with 
`hive.serialization.decode.binary.as.base64` set to `true`.
   
   Edit: The default is `true` so presumably some test have this flag enabled.  
Are there any examples of this being exercised? (i.e., doing base-64 conversion 
on a data set?)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474332)
Time Spent: 1h 20m  (was: 1h 10m)

> Compute Stats Incorrect for Binary Columns
> --
>
> Key: HIVE-23829
> URL: https://issues.apache.org/jira/browse/HIVE-23829
> Project: Hive
>  Issue Type: Bug
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I came across an issue when working on [HIVE-22674].
> The SerDe used for processing binary data tries to auto-detect if the data is 
> in Base-64.  It uses 
> {{org.apache.commons.codec.binary.Base64#isArrayByteBase64}} which has two 
> issues:
> # It's slow since it will check if the array is compatible,... and then 
> process the data (examines the array twice)
> # More importantly, this method _Tests a given byte array to see if it 
> contains only valid characters within the Base64 alphabet. Currently the 
> method treats whitespace as valid._
> https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#isArrayByteBase64-byte:A-
> The 
> [qtest|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/test/queries/clientpositive/compute_stats_binary.q]
>  for this feature uses full sentences (which includes spaces) 
> [here|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/data/files/binary.txt]
>  and therefore it thinks this data is Base-64 and returns an incorrect 
> estimation for size.
> This should really not auto-detect Base64 data and instead it should be 
> enabled with a table property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23829) Compute Stats Incorrect for Binary Columns

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23829?focusedWorklogId=474329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474329
 ]

ASF GitHub Bot logged work on HIVE-23829:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 14:02
Start Date: 25/Aug/20 14:02
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #1313:
URL: https://github.com/apache/hive/pull/1313#issuecomment-680044199


   @HunterL Really great stuff.  Need one test with 
`hive.serialization.decode.binary.as.base64` set to `true`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474329)
Time Spent: 1h 10m  (was: 1h)

> Compute Stats Incorrect for Binary Columns
> --
>
> Key: HIVE-23829
> URL: https://issues.apache.org/jira/browse/HIVE-23829
> Project: Hive
>  Issue Type: Bug
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I came across an issue when working on [HIVE-22674].
> The SerDe used for processing binary data tries to auto-detect if the data is 
> in Base-64.  It uses 
> {{org.apache.commons.codec.binary.Base64#isArrayByteBase64}} which has two 
> issues:
> # It's slow since it will check if the array is compatible,... and then 
> process the data (examines the array twice)
> # More importantly, this method _Tests a given byte array to see if it 
> contains only valid characters within the Base64 alphabet. Currently the 
> method treats whitespace as valid._
> https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#isArrayByteBase64-byte:A-
> The 
> [qtest|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/test/queries/clientpositive/compute_stats_binary.q]
>  for this feature uses full sentences (which includes spaces) 
> [here|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/data/files/binary.txt]
>  and therefore it thinks this data is Base-64 and returns an incorrect 
> estimation for size.
> This should really not auto-detect Base64 data and instead it should be 
> enabled with a table property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore

2020-08-25 Thread Peter Vary (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183992#comment-17183992
 ] 

Peter Vary commented on HIVE-24032:
---

[~anishek], [~aasha]: I left a comment on the pull request 2 days ago which 
might be missed in the noise:
{quote}
Why is this a good thing?
AFAIK we introduced the Shims to be able to abstract out Hadoop dependencies 
and being able to work with different versions of Hadoop altogether. Removing 
shims will again fix as to a specific Hadoop version.
{quote}

I see that this was committed, but I still would like to understand the 
reasoning behind this change. Do we plan to remove the whole Shims module?

Thanks,
Peter


> Remove hadoop shims dependency and use FileSystem Api directly from 
> standalone metastore
> 
>
> Key: HIVE-24032
> URL: https://issues.apache.org/jira/browse/HIVE-24032
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24032.01.patch, HIVE-24032.02.patch, 
> HIVE-24032.03.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Remove hadoop shims dependency from standalone metastore. 
> Rename hive.repl.data.copy.lazy hive conf to 
> hive.repl.run.data.copy.tasks.on.target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24067) TestReplicationScenariosExclusiveReplica - Wrong FS error during DB drop

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24067?focusedWorklogId=474268=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474268
 ]

ASF GitHub Bot logged work on HIVE-24067:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 12:28
Start Date: 25/Aug/20 12:28
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1425:
URL: https://github.com/apache/hive/pull/1425#discussion_r476408842



##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/BaseReplicationAcrossInstances.java
##
@@ -114,10 +115,9 @@ public static void classLevelTearDown() throws IOException 
{
 replica.close();
   }
 
-  private static void setReplicaExternalBase(FileSystem fs, Map confMap) throws IOException {
+  private static void setFullyQualifiedReplicaExternalTableBase(FileSystem fs) 
throws IOException {

Review comment:
   Add a test to see that db doesn't exist after the drop and drop succeeded





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474268)
Time Spent: 20m  (was: 10m)

> TestReplicationScenariosExclusiveReplica - Wrong FS error during DB drop
> 
>
> Key: HIVE-24067
> URL: https://issues.apache.org/jira/browse/HIVE-24067
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24067.01.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In TestReplicationScenariosExclusiveReplica during drop database operation 
> for primary db, it leads to wrong FS error as the ReplChangeManager is 
> associated with replica FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-25 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183964#comment-17183964
 ] 

Shubham Chaurasia commented on HIVE-24059:
--

Fixed tests, all green now - 
http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1418/3/pipeline

> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.
> cc [~prasanth_j] [~jdere] [~anishek] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23123) Disable export/import of views and materialized views

2020-08-25 Thread Anishek Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183958#comment-17183958
 ] 

Anishek Agarwal commented on HIVE-23123:


I agree that in the context of import/export statements, we might not require 
supporting view/MV. however the code paths for replication and export/import is 
common in lot of places since they do similar things during bootstrap. 
For replication including view + MV would be required and work well since we 
would copy the state of MV based on valid Txn List. Currently replication 
performs view replication as it is ( Assuming the view definition is only 
restricted to data in the same db ) and disables MV's.  We will try to prevent 
MV export/import and only enable it via replication. 

> Disable export/import of views and materialized views
> -
>
> Key: HIVE-23123
> URL: https://issues.apache.org/jira/browse/HIVE-23123
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23123.01.patch, HIVE-23123.02.patch, 
> HIVE-23123.03.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> According to 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport]
>  import and export can be done by using the
> {code:java}
> export table ...
> import table ... 
> {code}
> commands. The document doesn't mention views or materialized views at all, 
> and in fact we don't support commands like
> {code:java}
> export view ...
> import view ...
> export materialized view ...
> import materialized view ... 
> {code}
> they can not be parsed at all. The word table is often used though in a 
> broader sense, when it means all table like entities, including views and 
> materialized views. For example the various Table classes may represent any 
> of these as well.
> If I try to export a view with the export table ... command, it goes fine. A 
> _metadata file will be created, but no data directory, which is what we'd 
> expect. If I try to import it back, an exception is thrown due to the lack of 
> the data dir:
> {code:java}
> java.lang.AssertionError: null==getPath() for exim_view
>  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:3088)
>  at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:419)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>  at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:364)
>  at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:335)
>  at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
>  at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:722)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:491)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:485) 
> {code}
> Still the view gets imported successfully, as data movement wasn't even 
> necessary.
> If we try to export a materialized view which is transactional, then this 
> exception occurs:
> {code:java}
> org.apache.hadoop.hive.ql.parse.SemanticException: 
> org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found 
> exim_materialized_view_da21d41a_9fe4_4446_9c72_d251496abf9d
>  at 
> org.apache.hadoop.hive.ql.parse.AcidExportSemanticAnalyzer.analyzeAcidExport(AcidExportSemanticAnalyzer.java:163)
>  at 
> org.apache.hadoop.hive.ql.parse.AcidExportSemanticAnalyzer.analyze(AcidExportSemanticAnalyzer.java:71)
>  at 
> org.apache.hadoop.hive.ql.parse.RewriteSemanticAnalyzer.analyzeInternal(RewriteSemanticAnalyzer.java:72)
>  at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:289)
>  at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
>  at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:183)
>  at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:601)
>  at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:547)
>  at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:541) 
> {code}
> So the export process can not handle it, as the temporary table is not 
> getting created.
>  
> The import command handling have a lot of codes dedicated to importing views 
> and materialized views, which suggests that we support the importing (and 
> thus also suggests implicitly that we support the exporting) of views and 
> materialiezed views.
>  
> So the conclusion is that we have to decide if we support exporting/importing 
> of views and 

[jira] [Assigned] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan reassigned HIVE-24061:
---

Assignee: Rajesh Balamohan

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> request.shouldDelayForLocality(schedulerAttemptTime);
> ..
> ..
> boolean shouldDelayForLocality(long schedulerAttemptTime) {
>   return localityDelayTimeout > schedulerAttemptTime;
> }
> {code}
>  
> Ideally, "localityDelayTimeout" should be adjusted based on it's first 
> scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24015) Disable query-based compaction on MR execution engine

2020-08-25 Thread Karen Coppage (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage resolved HIVE-24015.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Committed to master Aug 13th.

> Disable query-based compaction on MR execution engine
> -
>
> Key: HIVE-24015
> URL: https://issues.apache.org/jira/browse/HIVE-24015
> Project: Hive
>  Issue Type: Task
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Major compaction can be run when the execution engine is MR. This can cause 
> data loss a la HIVE-23703 (the fix for data loss when the execution engine is 
> MR was reverted by HIVE-23763).
> Currently minor compaction can only be run when the execution engine is Tez, 
> otherwise it falls back to MR (non-query-based) compaction. We should extend 
> this functionality to major compaction as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?focusedWorklogId=474236=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474236
 ]

ASF GitHub Bot logged work on HIVE-24061:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 10:52
Start Date: 25/Aug/20 10:52
Worklog Time Spent: 10m 
  Work Description: rbalamohan opened a new pull request #1431:
URL: https://github.com/apache/hive/pull/1431


   https://issues.apache.org/jira/browse/HIVE-24061
   
   Changes:
   1. Adjust locality delay when the task is getting scheduled.
   2. Reset locality delay when all nodes in the cluster are busy and wouldn't 
be able to schedule tasks.
   3. Optimize schedulePendingTasks to exit early, when all nodes are busy. 
This helps in reducing lock contention as well.
   
   Patch was tested on a medium scale cluster and observed good improvement in 
runtime of queries.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474236)
Remaining Estimate: 0h
Time Spent: 10m

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> request.shouldDelayForLocality(schedulerAttemptTime);
> ..
> ..
> boolean shouldDelayForLocality(long schedulerAttemptTime) {
>   return localityDelayTimeout > schedulerAttemptTime;
> }
> {code}
>  
> Ideally, "localityDelayTimeout" should be adjusted based on it's first 
> scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24061:
--
Labels: perfomance pull-request-available  (was: perfomance)

> Improve llap task scheduling for better cache hit rate 
> ---
>
> Key: HIVE-24061
> URL: https://issues.apache.org/jira/browse/HIVE-24061
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TaskInfo is initialized with the "requestTime and locality delay". When lots 
> of vertices are in the same level, "taskInfo" details would be available 
> upfront. By the time, it gets to scheduling, "requestTime + localityDelay" 
> won't be higher than current time. Due to this, it misses scheduling delay 
> details and ends up choosing random node. This ends up missing cache hits and 
> reads data from remote storage.
> E.g Observed this pattern in Q75 of tpcds.
> Related lines of interest in scheduler: 
> [https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
>  
> |https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
> {code:java}
>boolean shouldDelayForLocality = 
> request.shouldDelayForLocality(schedulerAttemptTime);
> ..
> ..
> boolean shouldDelayForLocality(long schedulerAttemptTime) {
>   return localityDelayTimeout > schedulerAttemptTime;
> }
> {code}
>  
> Ideally, "localityDelayTimeout" should be adjusted based on it's first 
> scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183905#comment-17183905
 ] 

Ramesh Kumar Thangarajan commented on HIVE-24070:
-

[~anishek] Do we already have a Jira for that? Otherwise I can create one.

> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
>  It should fetch events in batches.
> Similar to https://issues.apache.org/jira/browse/HIVE-19430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183900#comment-17183900
 ] 

Ramesh Kumar Thangarajan commented on HIVE-24070:
-

the issue you mentioned is still present and needs to be addressed though it 
might not cause the service to stop. The OOM currently cause the service to 
stop. We might have to create a separate Jira to address that issue.

> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
>  It should fetch events in batches.
> Similar to https://issues.apache.org/jira/browse/HIVE-19430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183898#comment-17183898
 ] 

Ramesh Kumar Thangarajan commented on HIVE-24070:
-

Jira HIVE-19430 only solves the OOM issue with cleanNotificationEvents(). We 
still reach OOM through cleanWriteNotificationEvents() at 
[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L10429]

> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
>  It should fetch events in batches.
> Similar to https://issues.apache.org/jira/browse/HIVE-19430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24065?focusedWorklogId=474213=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474213
 ]

ASF GitHub Bot logged work on HIVE-24065:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:56
Start Date: 25/Aug/20 08:56
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1423:
URL: https://github.com/apache/hive/pull/1423#discussion_r476289475



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorInBloomFilterColDynamicValue.java
##
@@ -100,26 +103,39 @@ public void init(Configuration conf) {
 default:
   throw new IllegalStateException("Unsupported type " + colVectorType);
 }
+
+String queryId = HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID);
+runtimeCache = ObjectCacheFactory.getCache(conf, queryId, false, true);
   }
 
-  private void initValue()  {
-InputStream in = null;
+  private void initValue() {
 try {
-  Object val = bloomFilterDynamicValue.getValue();
-  if (val != null) {
-BinaryObjectInspector boi = (BinaryObjectInspector) 
bloomFilterDynamicValue.getObjectInspector();
-byte[] bytes = boi.getPrimitiveJavaObject(val);
-in = new NonSyncByteArrayInputStream(bytes);
-bloomFilter = BloomKFilter.deserialize(in);
-  } else {
-bloomFilter = null;
-  }
-  initialized = true;
-} catch (Exception err) {
-  throw new RuntimeException(err);
-} finally {
-  IOUtils.closeStream(in);

Review comment:
   added it back and force pushed
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474213)
Time Spent: 1h  (was: 50m)

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24065?focusedWorklogId=474201=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474201
 ]

ASF GitHub Bot logged work on HIVE-24065:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:26
Start Date: 25/Aug/20 08:26
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1423:
URL: https://github.com/apache/hive/pull/1423#discussion_r476270406



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorInBloomFilterColDynamicValue.java
##
@@ -100,26 +103,39 @@ public void init(Configuration conf) {
 default:
   throw new IllegalStateException("Unsupported type " + colVectorType);
 }
+
+String queryId = HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID);
+runtimeCache = ObjectCacheFactory.getCache(conf, queryId, false, true);
   }
 
-  private void initValue()  {
-InputStream in = null;
+  private void initValue() {
 try {
-  Object val = bloomFilterDynamicValue.getValue();
-  if (val != null) {
-BinaryObjectInspector boi = (BinaryObjectInspector) 
bloomFilterDynamicValue.getObjectInspector();
-byte[] bytes = boi.getPrimitiveJavaObject(val);
-in = new NonSyncByteArrayInputStream(bytes);
-bloomFilter = BloomKFilter.deserialize(in);
-  } else {
-bloomFilter = null;
-  }
-  initialized = true;
-} catch (Exception err) {
-  throw new RuntimeException(err);
-} finally {
-  IOUtils.closeStream(in);

Review comment:
   good catch, I think it's needed, or at least I removed it accidentally





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474201)
Time Spent: 50m  (was: 40m)

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Anishek Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183861#comment-17183861
 ] 

Anishek Agarwal commented on HIVE-24070:


can you give some more details on this please, i thought the config introduced 
as part of HIVE-19430 allows us to control how many events we want to delete. 
however looks like there should be an inner loop since if the number of events 
generated in sleepTime are more than deleted due to EVENT_CLEAN_MAX_EVENTS then 
the db will never get cleaned.

cc [~aasha] / [~pkumarsinha]

> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
>  It should fetch events in batches.
> Similar to https://issues.apache.org/jira/browse/HIVE-19430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24065?focusedWorklogId=474196=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474196
 ]

ASF GitHub Bot logged work on HIVE-24065:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:19
Start Date: 25/Aug/20 08:19
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1423:
URL: https://github.com/apache/hive/pull/1423#discussion_r476265302



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorInBloomFilterColDynamicValue.java
##
@@ -100,26 +103,39 @@ public void init(Configuration conf) {
 default:
   throw new IllegalStateException("Unsupported type " + colVectorType);
 }
+
+String queryId = HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID);
+runtimeCache = ObjectCacheFactory.getCache(conf, queryId, false, true);
   }
 
-  private void initValue()  {
-InputStream in = null;
+  private void initValue() {
 try {
-  Object val = bloomFilterDynamicValue.getValue();
-  if (val != null) {
-BinaryObjectInspector boi = (BinaryObjectInspector) 
bloomFilterDynamicValue.getObjectInspector();
-byte[] bytes = boi.getPrimitiveJavaObject(val);
-in = new NonSyncByteArrayInputStream(bytes);
-bloomFilter = BloomKFilter.deserialize(in);
-  } else {
-bloomFilter = null;
-  }
-  initialized = true;
-} catch (Exception err) {
-  throw new RuntimeException(err);
-} finally {
-  IOUtils.closeStream(in);

Review comment:
   no...for bytearrayinputstream its not needed 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474196)
Time Spent: 40m  (was: 0.5h)

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24065?focusedWorklogId=474192=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474192
 ]

ASF GitHub Bot logged work on HIVE-24065:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:17
Start Date: 25/Aug/20 08:17
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1423:
URL: https://github.com/apache/hive/pull/1423#discussion_r476264312



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorInBloomFilterColDynamicValue.java
##
@@ -100,26 +103,39 @@ public void init(Configuration conf) {
 default:
   throw new IllegalStateException("Unsupported type " + colVectorType);
 }
+
+String queryId = HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID);
+runtimeCache = ObjectCacheFactory.getCache(conf, queryId, false, true);
   }
 
-  private void initValue()  {
-InputStream in = null;
+  private void initValue() {
 try {
-  Object val = bloomFilterDynamicValue.getValue();
-  if (val != null) {
-BinaryObjectInspector boi = (BinaryObjectInspector) 
bloomFilterDynamicValue.getObjectInspector();
-byte[] bytes = boi.getPrimitiveJavaObject(val);
-in = new NonSyncByteArrayInputStream(bytes);
-bloomFilter = BloomKFilter.deserialize(in);
-  } else {
-bloomFilter = null;
-  }
-  initialized = true;
-} catch (Exception err) {
-  throw new RuntimeException(err);
-} finally {
-  IOUtils.closeStream(in);

Review comment:
   I don't see this close in the new implementation...isn't that needed?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474192)
Time Spent: 0.5h  (was: 20m)

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HIVE-24023) Hive parquet reader can't read files with length=0

2020-08-25 Thread Karen Coppage (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183852#comment-17183852
 ] 

Karen Coppage edited comment on HIVE-24023 at 8/25/20, 8:14 AM:


Committed to master. Thanks for the review [~kuczoram]!


was (Author: klcopp):
Thanks for the review [~kuczoram]!

> Hive parquet reader can't read files with length=0
> --
>
> Key: HIVE-24023
> URL: https://issues.apache.org/jira/browse/HIVE-24023
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only parquet tables by creating a base directory 
> containing a completely empty file.
> Hive throws an exception upon reading when it looks for metadata:
> {code:java}
> Error: java.io.IOException: java.lang.RuntimeException:  is not a 
> Parquet file (too small length: 0) (state=,code=0){code}
> We can introduce a check for an empty file before Hive tries to read the 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-22782) Consolidate metastore call to fetch constraints

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22782?focusedWorklogId=474189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474189
 ]

ASF GitHub Bot logged work on HIVE-22782:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:14
Start Date: 25/Aug/20 08:14
Worklog Time Spent: 10m 
  Work Description: ashish-kumar-sharma edited a comment on pull request 
#1419:
URL: https://github.com/apache/hive/pull/1419#issuecomment-679876474


   @vineetgarg02 Can you please review this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474189)
Time Spent: 0.5h  (was: 20m)

> Consolidate metastore call to fetch constraints
> ---
>
> Key: HIVE-22782
> URL: https://issues.apache.org/jira/browse/HIVE-22782
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Ashish Sharma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently separate calls are made to metastore to fetch constraints like Pk, 
> fk, not null etc. Since planner always retrieve these constraints we should 
> retrieve all of them in one call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-22782) Consolidate metastore call to fetch constraints

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22782?focusedWorklogId=474188=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474188
 ]

ASF GitHub Bot logged work on HIVE-22782:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:14
Start Date: 25/Aug/20 08:14
Worklog Time Spent: 10m 
  Work Description: ashish-kumar-sharma commented on pull request #1419:
URL: https://github.com/apache/hive/pull/1419#issuecomment-679876474


   @vineetgarg02 Can please review this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474188)
Time Spent: 20m  (was: 10m)

> Consolidate metastore call to fetch constraints
> ---
>
> Key: HIVE-22782
> URL: https://issues.apache.org/jira/browse/HIVE-22782
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Ashish Sharma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently separate calls are made to metastore to fetch constraints like Pk, 
> fk, not null etc. Since planner always retrieve these constraints we should 
> retrieve all of them in one call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24023) Hive parquet reader can't read files with length=0

2020-08-25 Thread Karen Coppage (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage resolved HIVE-24023.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Thanks for the review [~kuczoram]!

> Hive parquet reader can't read files with length=0
> --
>
> Key: HIVE-24023
> URL: https://issues.apache.org/jira/browse/HIVE-24023
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only parquet tables by creating a base directory 
> containing a completely empty file.
> Hive throws an exception upon reading when it looks for metadata:
> {code:java}
> Error: java.io.IOException: java.lang.RuntimeException:  is not a 
> Parquet file (too small length: 0) (state=,code=0){code}
> We can introduce a check for an empty file before Hive tries to read the 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24023) Hive parquet reader can't read files with length=0

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24023?focusedWorklogId=474186=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474186
 ]

ASF GitHub Bot logged work on HIVE-24023:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 08:04
Start Date: 25/Aug/20 08:04
Worklog Time Spent: 10m 
  Work Description: klcopp merged pull request #1388:
URL: https://github.com/apache/hive/pull/1388


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474186)
Time Spent: 20m  (was: 10m)

> Hive parquet reader can't read files with length=0
> --
>
> Key: HIVE-24023
> URL: https://issues.apache.org/jira/browse/HIVE-24023
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only parquet tables by creating a base directory 
> containing a completely empty file.
> Hive throws an exception upon reading when it looks for metadata:
> {code:java}
> Error: java.io.IOException: java.lang.RuntimeException:  is not a 
> Parquet file (too small length: 0) (state=,code=0){code}
> We can introduce a check for an empty file before Hive tries to read the 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24023) Hive parquet reader can't read files with length=0

2020-08-25 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183843#comment-17183843
 ] 

Marta Kuczora commented on HIVE-24023:
--

+1

Thanks a lot [~klcopp] for the patch.

> Hive parquet reader can't read files with length=0
> --
>
> Key: HIVE-24023
> URL: https://issues.apache.org/jira/browse/HIVE-24023
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only parquet tables by creating a base directory 
> containing a completely empty file.
> Hive throws an exception upon reading when it looks for metadata:
> {code:java}
> Error: java.io.IOException: java.lang.RuntimeException:  is not a 
> Parquet file (too small length: 0) (state=,code=0){code}
> We can introduce a check for an empty file before Hive tries to read the 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramesh Kumar Thangarajan updated HIVE-24070:

Description: 
If there are large number of events that haven't been cleaned up for some 
reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
while it loads all the events to be deleted.
 It should fetch events in batches.

Similar to https://issues.apache.org/jira/browse/HIVE-19430

  was:
If there are large number of events that haven't been cleaned up for some 
reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
while it loads all the events to be deleted.
It should fetch events in batches.


> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
>  It should fetch events in batches.
> Similar to https://issues.apache.org/jira/browse/HIVE-19430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24070) ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of pending events

2020-08-25 Thread Ramesh Kumar Thangarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramesh Kumar Thangarajan reassigned HIVE-24070:
---


> ObjectStore.cleanWriteNotificationEvents OutOfMemory on large number of 
> pending events
> --
>
> Key: HIVE-24070
> URL: https://issues.apache.org/jira/browse/HIVE-24070
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
> Fix For: 4.0.0
>
>
> If there are large number of events that haven't been cleaned up for some 
> reason, then ObjectStore.cleanWriteNotificationEvents() can run out of memory 
> while it loads all the events to be deleted.
> It should fetch events in batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24068) Add re-execution plugin for handling DAG submission failures

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24068?focusedWorklogId=474185=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474185
 ]

ASF GitHub Bot logged work on HIVE-24068:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 07:58
Start Date: 25/Aug/20 07:58
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1428:
URL: https://github.com/apache/hive/pull/1428#discussion_r476242856



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/reexec/ReExecutionDagSubmitPlugin.java
##
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.reexec;
+
+import org.apache.hadoop.hive.ql.Driver;
+import org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext;
+import org.apache.hadoop.hive.ql.hooks.HookContext;
+import org.apache.hadoop.hive.ql.hooks.HookContext.HookType;
+import org.apache.hadoop.hive.ql.plan.mapper.PlanMapper;
+import org.apache.hadoop.hive.ql.processors.CommandProcessorException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Re-Executes a query when DAG submission fails after get session returned 
successfully.
+ */
+public class ReExecutionDagSubmitPlugin implements IReExecutionPlugin {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(ReExecutionDagSubmitPlugin.class);
+  class LocalHook implements ExecuteWithHookContext {
+
+@Override
+public void run(HookContext hookContext) throws Exception {
+  if (hookContext.getHookType() == HookType.ON_FAILURE_HOOK) {
+Throwable exception = hookContext.getException();
+if (exception != null) {
+  if (exception.getMessage() != null) {
+// there could be race condition where getSession could return a 
healthy AM but by the time DAG is submitted
+// the AM could become unhealthy/unreachable (possible DNS or 
network issues) which can fail tez DAG
+// submission. Since the DAG hasn't started execution yet this 
failure can be safely restarted.

Review comment:
   I think this explanation could be moved into the class documentation





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474185)
Time Spent: 20m  (was: 10m)

> Add re-execution plugin for handling DAG submission failures
> 
>
> Key: HIVE-24068
> URL: https://issues.apache.org/jira/browse/HIVE-24068
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> DAG submission failure can also happen in environments where AM container 
> died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't 
> started execution yet. There are retries at getSession and submitDAG level 
> individually but some submitDAG failure has to retry getSession as well as AM 
> could be unreachable, this can be handled in re-execution plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23938:
--
Labels: pull-request-available  (was: )

> LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used 
> anymore
> 
>
> Key: HIVE-23938
> URL: https://issues.apache.org/jira/browse/HIVE-23938
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: gc_2020-07-27-13.log, gc_2020-07-29-12.jdk8.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/hive/blob/master/llap-server/bin/runLlapDaemon.sh#L55
> {code}
> JAVA_OPTS_BASE="-server -Djava.net.preferIPv4Stack=true -XX:+UseNUMA 
> -XX:+PrintGCDetails -verbose:gc -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M -XX:+PrintGCDateStamps"
> {code}
> on JDK11 I got something like:
> {code}
> + exec /usr/lib/jvm/jre-11-openjdk/bin/java -Dproc_llapdaemon -Xms32000m 
> -Xmx64000m -Dhttp.maxConnections=17 -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA 
> -XX:+AggressiveOpts -XX:MetaspaceSize=1024m 
> -XX:InitiatingHeapOccupancyPercent=80 -XX:MaxGCPauseMillis=200 
> -XX:+PreserveFramePointer -XX:AllocatePrefetchStyle=2 
> -Dhttp.maxConnections=10 -Dasync.profiler.home=/grid/0/async-profiler -server 
> -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+PrintGCDetails -verbose:gc 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M 
> -XX:+PrintGCDateStamps 
> -Xloggc:/grid/2/yarn/container-logs/application_1595375468459_0113/container_e26_1595375468459_0113_01_09/gc_2020-07-27-12.log
>  
> ... 
> org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon
> OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in 
> version 11.0 and will likely be removed in a future release.
> Unrecognized VM option 'UseGCLogFileRotation'
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> These are not valid in JDK11:
> {code}
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles
> -XX:GCLogFileSize
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps
> {code}
> Instead something like:
> {code}
> -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=4,filesize=100M
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23725) ValidTxnManager snapshot outdating causing partial reads in merge insert

2020-08-25 Thread Zoltan Haindrich (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183835#comment-17183835
 ] 

Zoltan Haindrich commented on HIVE-23725:
-

* this patch added an arbitrary MAX_EXECUTION 10 ?
* it's enabled by default it shouldn't be...it should be pluggable - so 
that you  don't mess up other plugins like this patch have done
* uses CommandProcessorException instead of tapping into the hooks? I see no 
benefit in that..
* and changed ALL existing plugin to check for max executions ? 

why didn't you guys pinged me?


> ValidTxnManager snapshot outdating causing partial reads in merge insert
> 
>
> Key: HIVE-23725
> URL: https://issues.apache.org/jira/browse/HIVE-23725
> Project: Hive
>  Issue Type: Bug
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the ValidTxnManager invalidates the snapshot during merge insert and 
> starts to read committed transactions that were not committed when the query 
> compilation happened, it can cause partial read problems if the committed 
> transaction created new partition in the source or target table.
> The solution should be not only fix the snapshot but also recompile the query 
> and acquire the locks again.
> You could construct an example like this:
> 1. open and compile transaction 1 that merge inserts data from a partitioned 
> source table that has a few partition.
> 2. Open, run and commit transaction 2 that inserts data to an old and a new 
> partition to the source table.
> 3. Open, run and commit transaction 3 that inserts data to the target table 
> of the merge statement, that will retrigger a snapshot generation in 
> transaction 1.
> 4. Run transaction 1, the snapshot will be regenerated, and it will read 
> partial data from transaction 2 breaking the ACID properties.
> Different setup.
> Switch the transaction order:
> 1. compile transaction 1 that inserts data to an old and a new partition of 
> the source table.
> 2. compile transaction 2 that insert data to the target table
> 2. compile transaction 3 that merge inserts data from the source table to the 
> target table
> 3. run and commit transaction 1
> 4. run and commit transaction 2
> 5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
> isValidTxnListState will be triggered and we do a partial read of the 
> transaction 1 for the same reasons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23938?focusedWorklogId=474182=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474182
 ]

ASF GitHub Bot logged work on HIVE-23938:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 07:54
Start Date: 25/Aug/20 07:54
Worklog Time Spent: 10m 
  Work Description: abstractdog opened a new pull request #1430:
URL: https://github.com/apache/hive/pull/1430


   ### What changes were proposed in this pull request?
   Changed JVM opts to JDK11 compatible in runLlapDaemon.sh
   
   
   ### Why are the changes needed?
   Old options are not compatible with JDK11, and JVM fails to start.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Only expert user change, as gc logs will have a slightly different format 
with the change.
   
   
   ### How was this patch tested?
   Tested on cluster.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474182)
Remaining Estimate: 0h
Time Spent: 10m

> LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used 
> anymore
> 
>
> Key: HIVE-23938
> URL: https://issues.apache.org/jira/browse/HIVE-23938
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: gc_2020-07-27-13.log, gc_2020-07-29-12.jdk8.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/hive/blob/master/llap-server/bin/runLlapDaemon.sh#L55
> {code}
> JAVA_OPTS_BASE="-server -Djava.net.preferIPv4Stack=true -XX:+UseNUMA 
> -XX:+PrintGCDetails -verbose:gc -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M -XX:+PrintGCDateStamps"
> {code}
> on JDK11 I got something like:
> {code}
> + exec /usr/lib/jvm/jre-11-openjdk/bin/java -Dproc_llapdaemon -Xms32000m 
> -Xmx64000m -Dhttp.maxConnections=17 -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA 
> -XX:+AggressiveOpts -XX:MetaspaceSize=1024m 
> -XX:InitiatingHeapOccupancyPercent=80 -XX:MaxGCPauseMillis=200 
> -XX:+PreserveFramePointer -XX:AllocatePrefetchStyle=2 
> -Dhttp.maxConnections=10 -Dasync.profiler.home=/grid/0/async-profiler -server 
> -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+PrintGCDetails -verbose:gc 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M 
> -XX:+PrintGCDateStamps 
> -Xloggc:/grid/2/yarn/container-logs/application_1595375468459_0113/container_e26_1595375468459_0113_01_09/gc_2020-07-27-12.log
>  
> ... 
> org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon
> OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in 
> version 11.0 and will likely be removed in a future release.
> Unrecognized VM option 'UseGCLogFileRotation'
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> These are not valid in JDK11:
> {code}
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles
> -XX:GCLogFileSize
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps
> {code}
> Instead something like:
> {code}
> -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=4,filesize=100M
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24065?focusedWorklogId=474174=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474174
 ]

ASF GitHub Bot logged work on HIVE-24065:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 07:44
Start Date: 25/Aug/20 07:44
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on pull request #1423:
URL: https://github.com/apache/hive/pull/1423#issuecomment-679861897


   @rbalamohan : could you please take a look? simple patch, tested on cluster



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474174)
Time Spent: 20m  (was: 10m)

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23408) Hive on Tez : Kafka storage handler broken in secure environment

2020-08-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/HIVE-23408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183759#comment-17183759
 ] 

László Bodor commented on HIVE-23408:
-

[~Rajkumar Singh]: could you please take a look at the patch on the PR? I would 
really appreciate a review so we can merge and include this patch into next 
downstream release

> Hive on Tez :  Kafka storage handler broken in secure environment
> -
>
> Key: HIVE-23408
> URL: https://issues.apache.org/jira/browse/HIVE-23408
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: Rajkumar Singh
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> hive.server2.authentication.kerberos.principal set in the form of 
> hive/_HOST@REALM,
> Tez task can start at the random NM host and unfold the value of _HOST with 
> the value of fqdn where it is running. this leads to an authentication issue.
> for LLAP there is fallback for LLAP daemon keytab/principal, Kafka 1.1 
> onwards support delegation token and we should take advantage of it for hive 
> on tez.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24064) Disable Materialized View Replication

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24064?focusedWorklogId=474152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474152
 ]

ASF GitHub Bot logged work on HIVE-24064:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 06:11
Start Date: 25/Aug/20 06:11
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1422:
URL: https://github.com/apache/hive/pull/1422#discussion_r476198220



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##
@@ -542,6 +549,24 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData 
dmd, Path cmRoot, Hive
   if (ev.getEventId() <= resumeFrom) {
 continue;
   }
+
+  //disable materialized-view replication if configured
+  String tblName = null;
+  tblName = ev.getTableName();
+  if(tblName != null) {
+try {
+  Table table = null;
+  HiveWrapper.Tuple TableTuple = new HiveWrapper(hiveDb, 
dbName).table(tblName, conf);

Review comment:
   rename the variable. should start with small letter

##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -540,6 +540,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 REPL_DUMP_METADATA_ONLY("hive.repl.dump.metadata.only", false,
 "Indicates whether replication dump only metadata information or data 
+ metadata. \n"
   + "This config makes hive.repl.include.external.tables config 
ineffective."),
+REPL_MATERIALIZED_VIEWS_ENABLED("hive.repl.materialized.views.enabled", 
false,

Review comment:
   rename to REPL_INCLUDE_MATERIALIZED_VIEWS to match the other config 
naming

##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenarios.java
##
@@ -136,7 +136,7 @@
   protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationScenarios.class);
   private ArrayList lastResults;
 
-  private final boolean VERIFY_SETUP_STEPS = false;
+  private boolean VERIFY_SETUP_STEPS = false;

Review comment:
   can be removed. the value is overwritten inside the tests so not a 
constant

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##
@@ -542,6 +549,24 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData 
dmd, Path cmRoot, Hive
   if (ev.getEventId() <= resumeFrom) {
 continue;
   }
+
+  //disable materialized-view replication if configured
+  String tblName = null;
+  tblName = ev.getTableName();
+  if(tblName != null) {
+try {
+  Table table = null;
+  HiveWrapper.Tuple TableTuple = new HiveWrapper(hiveDb, 
dbName).table(tblName, conf);
+  table = TableTuple != null ? TableTuple.object : null;

Review comment:
   can be initialised in the same statement

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##
@@ -542,6 +549,24 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData 
dmd, Path cmRoot, Hive
   if (ev.getEventId() <= resumeFrom) {
 continue;
   }
+
+  //disable materialized-view replication if configured
+  String tblName = null;
+  tblName = ev.getTableName();

Review comment:
   can be initialised in the same statement

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##
@@ -542,6 +549,24 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData 
dmd, Path cmRoot, Hive
   if (ev.getEventId() <= resumeFrom) {
 continue;
   }
+
+  //disable materialized-view replication if configured
+  String tblName = null;
+  tblName = ev.getTableName();
+  if(tblName != null) {
+try {
+  Table table = null;
+  HiveWrapper.Tuple TableTuple = new HiveWrapper(hiveDb, 
dbName).table(tblName, conf);
+  table = TableTuple != null ? TableTuple.object : null;
+  if (TableTuple != null && 
TableType.MATERIALIZED_VIEW.equals(TableTuple.object.getTableType()) && 
!isMaterializedViewsReplEnabled()) {

Review comment:
   isMaterializedViewsReplEnabled check can be done at the top. no 
processing is needed if its enabled

##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenarios.java
##
@@ -2518,11 +2518,16 @@ public void testRenamePartitionedTableAcrossDatabases() 
throws IOException {
 
   @Test
   public void testViewsReplication() throws IOException {
+boolean verify_setup_tmp = VERIFY_SETUP_STEPS;

Review comment:
   you can add a new test just for MV. 
   
   Create MV with same DB and diff DB
   Test after bootstrap, replica shouldn't have MV.
   
   Create MV with same DB and diff DB
   Test after incremental, replica shouldn't have MV.





[jira] [Work logged] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?focusedWorklogId=474151=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474151
 ]

ASF GitHub Bot logged work on HIVE-23954:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 06:03
Start Date: 25/Aug/20 06:03
Worklog Time Spent: 10m 
  Work Description: EugeneChung edited a comment on pull request #1414:
URL: https://github.com/apache/hive/pull/1414#issuecomment-679795129


   It seems the error of init-metastore is not related with my patch.
   
   ```
   [2020-08-20T11:29:42.268Z] Status: Downloaded newer image for postgres:latest
   
   [2020-08-20T11:31:48.820Z] 
3a1dc3a0b3a75eaf727731e23f9967bbc6007831481878432cad7c8354e0c922
   
   [2020-08-20T11:31:48.821Z] waiting for postgres to be available...
   
   [2020-08-20T11:31:48.821Z] psql: FATAL:  the database system is starting up
   
   [2020-08-20T11:31:48.821Z] ok
   
   [2020-08-20T11:31:48.821Z] NOTICE:  database 
"ms_hive_precommit_pr_1414_1_wqrcz_rrtk5_5s4x9" does not exist, skipping
   
   [2020-08-20T11:31:48.821Z] DROP DATABASE
   
   [2020-08-20T11:31:48.821Z] ERROR:  role "hive" does not exist
   ```
   
   
http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1414/1/tests
 shows all tests were passed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474151)
Time Spent: 0.5h  (was: 20m)

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 

[jira] [Work logged] (HIVE-24041) Extend semijoin conversion rules

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24041?focusedWorklogId=474150=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474150
 ]

ASF GitHub Bot logged work on HIVE-24041:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 06:02
Start Date: 25/Aug/20 06:02
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #1405:
URL: https://github.com/apache/hive/pull/1405#discussion_r476169201



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSemiJoinRule.java
##
@@ -33,194 +37,263 @@
 import org.apache.calcite.rex.RexBuilder;
 import org.apache.calcite.rex.RexNode;
 import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilder.GroupKey;
 import org.apache.calcite.tools.RelBuilderFactory;
 import org.apache.calcite.util.ImmutableBitSet;
+import org.apache.calcite.util.ImmutableIntList;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
 import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import com.google.common.collect.ImmutableList;
-import com.google.common.collect.Lists;
 
 import java.util.ArrayList;
 import java.util.List;
 
 /**
- * Planner rule that creates a {@code SemiJoinRule} from a
- * {@link org.apache.calcite.rel.core.Join} on top of a
- * {@link org.apache.calcite.rel.logical.LogicalAggregate}.
- *
- * TODO Remove this rule and use Calcite's SemiJoinRule. Not possible currently
- * since Calcite doesnt use RelBuilder for this rule and we want to generate 
HiveSemiJoin rel here.
+ * Class that gathers SemiJoin conversion rules.
  */
-public abstract class HiveSemiJoinRule extends RelOptRule {
+public class HiveSemiJoinRule {
 
-  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveSemiJoinRule.class);
+  public static final HiveProjectJoinToSemiJoinRule INSTANCE_PROJECT =
+  new HiveProjectJoinToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveProjectToSemiJoinRule INSTANCE_PROJECT =
-  new HiveProjectToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveAggregateJoinToSemiJoinRule INSTANCE_AGGREGATE =
+  new HiveAggregateJoinToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveProjectToSemiJoinRuleSwapInputs 
INSTANCE_PROJECT_SWAPPED =
-  new HiveProjectToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveProjectJoinToSemiJoinRuleSwapInputs 
INSTANCE_PROJECT_SWAPPED =
+  new 
HiveProjectJoinToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);
 
-  public static final HiveAggregateToSemiJoinRule INSTANCE_AGGREGATE =
-  new HiveAggregateToSemiJoinRule(HiveRelFactories.HIVE_BUILDER);
+  public static final HiveAggregateJoinToSemiJoinRuleSwapInputs 
INSTANCE_AGGREGATE_SWAPPED =
+  new 
HiveAggregateJoinToSemiJoinRuleSwapInputs(HiveRelFactories.HIVE_BUILDER);

Review comment:
   nit.: Is the parameter value always `HiveRelFactories.HIVE_BUILDER` ? It 
can be moved to the base class as a constant.

##
File path: ql/src/test/queries/clientpositive/auto_sortmerge_join_10.q
##
@@ -48,6 +48,8 @@ select count(*) from
   (select a.key as key, a.value as value from tbl2_n4 a where key < 6) subq2
   on subq1.key = subq2.key;
 
+set hive.auto.convert.sortmerge.join=false;

Review comment:
   Based on the test filename this test intend to test automatic conversion 
to a sort-merge join but it is not tested anymore if the feature is turned off.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474150)
Time Spent: 20m  (was: 10m)

> Extend semijoin conversion rules
> 
>
> Key: HIVE-24041
> URL: https://issues.apache.org/jira/browse/HIVE-24041
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This patch fixes a couple of limitations that can be seen in 
> {{cbo_query95.q}}, in particular:
> - It adds a rule to trigger semijoin conversion when the there is an 
> aggregate on top of the join that prunes all columns from left side, and the 
> aggregate operator is on the left input of the join.
> - It extends existing semijoin conversion rules to prune the unused 

[jira] [Work logged] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?focusedWorklogId=474149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474149
 ]

ASF GitHub Bot logged work on HIVE-23954:
-

Author: ASF GitHub Bot
Created on: 25/Aug/20 06:01
Start Date: 25/Aug/20 06:01
Worklog Time Spent: 10m 
  Work Description: EugeneChung commented on pull request #1414:
URL: https://github.com/apache/hive/pull/1414#issuecomment-679795129


   It seems the error is not related with my patch.
   
   ```
   [2020-08-20T11:29:42.268Z] Status: Downloaded newer image for postgres:latest
   
   [2020-08-20T11:31:48.820Z] 
3a1dc3a0b3a75eaf727731e23f9967bbc6007831481878432cad7c8354e0c922
   
   [2020-08-20T11:31:48.821Z] waiting for postgres to be available...
   
   [2020-08-20T11:31:48.821Z] psql: FATAL:  the database system is starting up
   
   [2020-08-20T11:31:48.821Z] ok
   
   [2020-08-20T11:31:48.821Z] NOTICE:  database 
"ms_hive_precommit_pr_1414_1_wqrcz_rrtk5_5s4x9" does not exist, skipping
   
   [2020-08-20T11:31:48.821Z] DROP DATABASE
   
   [2020-08-20T11:31:48.821Z] ERROR:  role "hive" does not exist
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 474149)
Time Spent: 20m  (was: 10m)

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch