[jira] [Resolved] (HIVE-26394) Query based compaction fails for table with more than 6 columns

2022-07-28 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-26394.

Resolution: Fixed

> Query based compaction fails for table with more than 6 columns
> ---
>
> Key: HIVE-26394
> URL: https://issues.apache.org/jira/browse/HIVE-26394
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Query based compaction creates a temp external table with location pointing 
> to the location of the table being compacted. So this external table has file 
> of ACID type. When query is done on this table, the table type is decided by 
> reading the files present at the table location. As the table location has 
> files compatible to ACID format, it is assuming it to be ACID table. This is 
> causing issue while generating the SARG columns as the column number does not 
> match with the schema.
>  
> {code:java}
> Error doing query based minor compaction
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to run INSERT into 
> table delta_cara_pn_tmp_compactor_clean_1656061070392_result select 
> `operation`, `originalTransaction`, `bucket`, `rowId`, `currentTransaction`, 
> `row` from delta_clean_1656061070392 where `originalTransaction` not in 
> (749,750,766,768,779,783,796,799,818,1145,1149,1150,1158,1159,1160,1165,1166,1169,1173,1175,1176,1871,9631)
>   at 
> org.apache.hadoop.hive.ql.DriverUtils.runOnDriver(DriverUtils.java:73)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.QueryCompactor.runCompactionQueries(QueryCompactor.java:138)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.MinorQueryCompactor.runCompaction(MinorQueryCompactor.java:70)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.Worker.findNextCompactionAndExecute(Worker.java:498)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.Worker.lambda$run$0(Worker.java:120)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: (responseCode = 2, errorMessage = FAILED: Execution Error, return 
> code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Map 1, vertexId=vertex_1656061159324__1_00, diagnostics=[Task 
> failed, taskId=task_1656061159324__1_00_00, diagnostics=[TaskAttempt 
> 0 failed, info=[Error: Error while running task ( failure ) : 
> attempt_1656061159324__1_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: 
> java.lang.ArrayIndexOutOfBoundsException: 6
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:277)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.lang.ArrayIndexOutOfBoundsException: 6
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> 

[jira] [Assigned] (HIVE-26394) Query based compaction fails for table with more than 6 columns

2022-07-14 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26394:
--


> Query based compaction fails for table with more than 6 columns
> ---
>
> Key: HIVE-26394
> URL: https://issues.apache.org/jira/browse/HIVE-26394
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Query based compaction creates a temp external table with location pointing 
> to the location of the table being compacted. So this external table has file 
> of ACID type. When query is done on this table, the table type is decided by 
> reading the files present at the table location. As the table location has 
> files compatible to ACID format, it is assuming it to be ACID table. This is 
> causing issue while generating the SARG columns as the column number does not 
> match with the schema.
>  
> {code:java}
> Error doing query based minor compaction
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to run INSERT into 
> table delta_cara_pn_tmp_compactor_clean_1656061070392_result select 
> `operation`, `originalTransaction`, `bucket`, `rowId`, `currentTransaction`, 
> `row` from delta_clean_1656061070392 where `originalTransaction` not in 
> (749,750,766,768,779,783,796,799,818,1145,1149,1150,1158,1159,1160,1165,1166,1169,1173,1175,1176,1871,9631)
>   at 
> org.apache.hadoop.hive.ql.DriverUtils.runOnDriver(DriverUtils.java:73)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.QueryCompactor.runCompactionQueries(QueryCompactor.java:138)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.MinorQueryCompactor.runCompaction(MinorQueryCompactor.java:70)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.Worker.findNextCompactionAndExecute(Worker.java:498)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.Worker.lambda$run$0(Worker.java:120)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: (responseCode = 2, errorMessage = FAILED: Execution Error, return 
> code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Map 1, vertexId=vertex_1656061159324__1_00, diagnostics=[Task 
> failed, taskId=task_1656061159324__1_00_00, diagnostics=[TaskAttempt 
> 0 failed, info=[Error: Error while running task ( failure ) : 
> attempt_1656061159324__1_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: 
> java.lang.ArrayIndexOutOfBoundsException: 6
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:277)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.lang.ArrayIndexOutOfBoundsException: 6
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
>   at 
> 

[jira] [Resolved] (HIVE-26382) Stats generation fails during CTAS for external partitioned table.

2022-07-11 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-26382.

Resolution: Fixed

> Stats generation fails during CTAS for external partitioned table.
> --
>
> Key: HIVE-26382
> URL: https://issues.apache.org/jira/browse/HIVE-26382
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Affects Versions: 4.0.0-alpha-1
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As part of HIVE-25990 manifest file was generated to list out the files to be 
> moved. The files are moved in move task by referring to the manifest file. 
> For partitioned table flow, the move is not done. This prevents the dynamic 
> partition creation as the target path will be empty. As stats task needs the 
> partition information, this causes the stat task to fail.
>  
> {code:java}
> class="metastore.RetryingHMSHandler" level="ERROR" 
> thread="pool-10-thread-144"] MetaException(message:Unable to update Column 
> stats for  ext_par due to: The IN list is empty!)
>  
> org.apache.hadoop.hive.metastore.DirectSqlUpdateStat.updatePartitionColumnStatistics(DirectSqlUpdateStat.java:634)
>  
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.updatePartitionColumnStatisticsBatch(MetaStoreDirectSql.java:2803)
>  
> org.apache.hadoop.hive.metastore.ObjectStore.updatePartitionColumnStatisticsInBatch(ObjectStore.java:10001)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
>  java.lang.reflect.Method.invoke(Method.java:498)
>  org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
> com.sun.proxy.$Proxy33.updatePartitionColumnStatisticsInBatch(Unknown Source)
>  
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsForOneBatch(HiveMetaStore.java:7124)
>  
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsInBatch(HiveMetaStore.java:7109)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-26382) Stats generation fails during CTAS for external partitioned table.

2022-07-11 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26382:
--


> Stats generation fails during CTAS for external partitioned table.
> --
>
> Key: HIVE-26382
> URL: https://issues.apache.org/jira/browse/HIVE-26382
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Affects Versions: 4.0.0-alpha-1
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> As part of HIVE-25990 manifest file was generated to list out the files to be 
> moved. The files are moved in move task by referring to the manifest file. 
> For partitioned table flow, the move is not done. This prevents the dynamic 
> partition creation as the target path will be empty. As stats task needs the 
> partition information, this causes the stat task to fail.
>  
> {code:java}
> class="metastore.RetryingHMSHandler" level="ERROR" 
> thread="pool-10-thread-144"] MetaException(message:Unable to update Column 
> stats for  ext_par due to: The IN list is empty!)
>  
> org.apache.hadoop.hive.metastore.DirectSqlUpdateStat.updatePartitionColumnStatistics(DirectSqlUpdateStat.java:634)
>  
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.updatePartitionColumnStatisticsBatch(MetaStoreDirectSql.java:2803)
>  
> org.apache.hadoop.hive.metastore.ObjectStore.updatePartitionColumnStatisticsInBatch(ObjectStore.java:10001)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
>  java.lang.reflect.Method.invoke(Method.java:498)
>  org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
> com.sun.proxy.$Proxy33.updatePartitionColumnStatisticsInBatch(Unknown Source)
>  
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsForOneBatch(HiveMetaStore.java:7124)
>  
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsInBatch(HiveMetaStore.java:7109)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-26222) Native GeoSpatial Support in Hive

2022-05-11 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26222:
--


> Native GeoSpatial Support in Hive
> -
>
> Key: HIVE-26222
> URL: https://issues.apache.org/jira/browse/HIVE-26222
> Project: Hive
>  Issue Type: Task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> This is an epic Jira to support GeoSpatial datatypes natively in Hive. This 
> will cater to the applications which queries on large volumes of spatial 
> data. This support will be added in a phased manner. To start with, we are 
> planning to make use of framework developed by ESRI 
> ([https://github.com/Esri/spatial-framework-for-hadoop).]   This project is 
> not very active and there is no release done to maven central. So its not 
> easy to get the jars downloaded directly using pom dependency. Also the UDFs 
> are based on older version of Hive. So we have decided to make a copy of this 
> repo and maintained it inside Hive. This will make it easier to do any 
> improvement and manage dependencies. As of now, the data loading is done only 
> on a binary data type. We need to enhance this  to make it more user 
> friendly. In the next phase, a native Geometry/Geography datatype will be 
> supported. User can directly create a geometry type and operate on it. Apart 
> from these we can start adding support for different indices like quad tree, 
> R-tree, ORC/Parquet/Iceberg support etc. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-25540) Enable batch update of column stats only for MySql and Postgres

2022-04-05 Thread mahesh kumar behera (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517302#comment-17517302
 ] 

mahesh kumar behera commented on HIVE-25540:


[~zabetak] 

The batch update is tested in scale for mysql and Postgres backend only. 

> Enable batch update of column stats only for MySql and Postgres 
> 
>
> Key: HIVE-25540
> URL: https://issues.apache.org/jira/browse/HIVE-25540
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The batch updation of partition column stats using direct sql is tested only 
> for MySql and Postgres.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-26105) Show columns shows extra values if column comments contains specific Chinese character

2022-04-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-26105.

Resolution: Fixed

> Show columns shows extra values if column comments contains specific Chinese 
> character 
> ---
>
> Key: HIVE-26105
> URL: https://issues.apache.org/jira/browse/HIVE-26105
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The issue is happening because the UTF code for one of the Chinese character 
> contains the binary value of '\r' (CR). Because of this, the Hadoop line 
> reader (used by fetch task in Hive) is assuming the value after that 
> character as new value and this extra value with junk is getting displayed. 
> The issue is with 0x540D 名 ... The last value is "D" ..that is 13. While 
> reading the result, Hadoop line reader interpreting it as CR ( '\r'). Thus an 
> extra value with Junk is coming as output. For show column, we do not need 
> the comments. So while writing to the file, only column names should be 
> included.
> [https://github.com/apache/hadoop/blob/0fbd96a2449ec49f840d93e1c7d290c5218ef4ea/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L238]
>  
> {code:java}
> create table tbl_test  (fld0 string COMMENT  '期 ' , fld string COMMENT 
> '期末日期', fld1 string COMMENT '班次名称', fld2  string COMMENT '排班人数');
> show columns from tbl_test;
> ++
> | field  |
> ++
> | fld    |
> | fld0   |
> | fld1   |
> | �      |
> | fld2   |
> ++
> 5 rows selected (171.809 seconds)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-26098) Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path causing IllegalArgumentException

2022-04-01 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-26098.

Resolution: Fixed

> Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path 
> causing IllegalArgumentException
> --
>
> Key: HIVE-26098
> URL: https://issues.apache.org/jira/browse/HIVE-26098
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  hive.aux.jars.path and hive.reloadable.aux.jars.path  are used for providing 
> auxiliary jars which are used doing query processing. These jars are copied 
> to Tez temp path so that the Tez jobs have access to these jars while 
> processing the job. There are a duplicate check to avoid copying the same jar 
> multiple times. This check assumes the jar to be in local file system. But in 
> real, the jars path can be anywhere. So this duplicate check fails, when the 
> source path is not in local path.
> {code:java}
> ERROR : Failed to execute tez graph.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://localhost:53877/tmp/test_jar/identity_udf.jar, expected: file:///
>     at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781) 
> ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.checkPreExisting(DagUtils.java:1392)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1411)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:1295)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:1177)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.ensureLocalResources(TezSessionState.java:636)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:283)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:241)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezTask.ensureSessionHasResources(TezTask.java:448)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:215) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:245) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:106) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:348) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:204) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:153) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:148) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> 

[jira] [Assigned] (HIVE-26105) Show columns shows extra values if column comments contains specific Chinese character

2022-03-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26105:
--


> Show columns shows extra values if column comments contains specific Chinese 
> character 
> ---
>
> Key: HIVE-26105
> URL: https://issues.apache.org/jira/browse/HIVE-26105
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The issue is happening because the UTF code for one of the Chinese character 
> contains the binary value of '\r' (CR). Because of this, the Hadoop line 
> reader (used by fetch task in Hive) is assuming the value after that 
> character as new value and this extra value with junk is getting displayed. 
> The issue is with 0x540D 名 ... The last value is "D" ..that is 13. While 
> reading the result, Hadoop line reader interpreting it as CR ( '\r'). Thus an 
> extra value with Junk is coming as output. For show column, we do not need 
> the comments. So while writing to the file, only column names should be 
> included.
> [https://github.com/apache/hadoop/blob/0fbd96a2449ec49f840d93e1c7d290c5218ef4ea/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L238]
>  
> {code:java}
> create table tbl_test  (fld0 string COMMENT  '期 ' , fld string COMMENT 
> '期末日期', fld1 string COMMENT '班次名称', fld2  string COMMENT '排班人数');
> show columns from tbl_test;
> ++
> | field  |
> ++
> | fld    |
> | fld0   |
> | fld1   |
> | �      |
> | fld2   |
> ++
> 5 rows selected (171.809 seconds)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-24649) Optimise Hive::addWriteNotificationLog for large data inserts

2022-03-31 Thread mahesh kumar behera (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515236#comment-17515236
 ] 

mahesh kumar behera commented on HIVE-24649:


[~rajesh.balamohan] 

I think this a taken care of by HIVE-25205. Can you please confirm.

> Optimise Hive::addWriteNotificationLog for large data inserts
> -
>
> Key: HIVE-24649
> URL: https://issues.apache.org/jira/browse/HIVE-24649
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> When loading dynamic partition with large dataset, it spends lot of time in 
> "Hive::loadDynamicPartitions --> addWriteNotificationLog".
> Though it is for same for same table, it ends up loading table and partition 
> details for every partition and writes to notification log.
> Also, "Partition" details may be already present in {{PartitionDetails}} 
> object in {{Hive::loadDynamicPartitions}}. This is unnecessarily recomputed 
> again in {{HiveMetaStore::add_write_notification_log}}
>  
> Lines of interest:
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3028
> https://github.com/apache/hive/blob/89073a94354f0cc14ec4ae0a43e05aae29276b4d/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L8500
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HIVE-26098) Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path causing IllegalArgumentException

2022-03-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26098:
--


> Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path 
> causing IllegalArgumentException
> --
>
> Key: HIVE-26098
> URL: https://issues.apache.org/jira/browse/HIVE-26098
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
>  hive.aux.jars.path and hive.reloadable.aux.jars.path  are used for providing 
> auxiliary jars which are used doing query processing. These jars are copied 
> to Tez temp path so that the Tez jobs have access to these jars while 
> processing the job. There are a duplicate check to avoid copying the same jar 
> multiple times. This check assumes the jar to be in local file system. But in 
> real, the jars path can be anywhere. So this duplicate check fails, when the 
> source path is not in local path.
> {code:java}
> ERROR : Failed to execute tez graph.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://localhost:53877/tmp/test_jar/identity_udf.jar, expected: file:///
>     at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781) 
> ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
>  ~[hadoop-common-3.1.0.jar:?]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.checkPreExisting(DagUtils.java:1392)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1411)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:1295)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:1177)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.ensureLocalResources(TezSessionState.java:636)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:283)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:241)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.tez.TezTask.ensureSessionHasResources(TezTask.java:448)
>  ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:215) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:245) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:106) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:348) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:204) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:153) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:148) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) 
> [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
>     at 
> 

[jira] [Assigned] (HIVE-26017) Insert with partition value containing colon and space is creating partition having wrong value

2022-03-09 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-26017:
--


> Insert with partition value containing colon and space is creating partition 
> having wrong value
> ---
>
> Key: HIVE-26017
> URL: https://issues.apache.org/jira/browse/HIVE-26017
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The path used for generating the dynamic partition value is obtained from 
> uri. This is causing the serialised value to be used for partition name 
> generation and wrong names are generated. The path value should be used, not 
> the URI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function

2022-01-20 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25864.

Resolution: Fixed

> Hive query optimisation creates wrong plan for predicate pushdown with 
> windowing function 
> --
>
> Key: HIVE-25864
> URL: https://issues.apache.org/jira/browse/HIVE-25864
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In case of a query with windowing function, the deterministic predicates are 
> pushed down below the window function. Before pushing down, the predicate is 
> converted to refer the project operator values. But the same conversion is 
> done again while creating the project and thus causing wrong plan generation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-25638) Select returns deleted records in Hive ACID table

2022-01-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25638.

Resolution: Fixed

> Select returns deleted records in Hive ACID table
> -
>
> Key: HIVE-25638
> URL: https://issues.apache.org/jira/browse/HIVE-25638
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hive stores the stripe stats in the ORC files. During select, these stats are 
> used to create the SARG. The SARG is used to reduce the records read from the 
> delete-delta files. Currently, in case where the number of stripes are more 
> than 1, the SARG generated is not proper as it uses the first stripe index 
> for both min and max key interval. The max key interval should be obtained 
> from last stripe index. This cause some valid deleted records to be skipped. 
> And those records are return to the user. We need the last stripe here 
> instead of the first one, is the fact the keys are ordered in the file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException

2022-01-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25877.

Resolution: Fixed

> Load table from concurrent thread causes FileNotFoundException
> --
>
> Key: HIVE-25877
> URL: https://issues.apache.org/jira/browse/HIVE-25877
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> As part of the direct insert optimisation (same issue is there for MM table 
> also, without direct insert optimisation), the files from Tez jobs are moved 
> to the table directory for ACID tables. Then the duplicate removal is done. 
> Each session scan through the tables and cleans up the file related to 
> specific session. But the iterator is created over all the files. So the 
> FileNotFoundException is thrown when multiple sessions are acting on same 
> table and the first session cleans up its data which is being read by the 
> second session.
> This is fixed as part of HIVE-24679
> {code:java}
> Caused by: java.io.FileNotFoundException: File 
> hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_
>  does not exist.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}
>  
> The below path is fixed by HIVE-24682
> {code:java}
> Caused by: java.io.FileNotFoundException: File 
> hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54
>  does not exist.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> 

[jira] [Updated] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException

2022-01-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25877:
---
Description: 
As part of the direct insert optimisation (same issue is there for MM table 
also, without direct insert optimisation), the files from Tez jobs are moved to 
the table directory for ACID tables. Then the duplicate removal is done. Each 
session scan through the tables and cleans up the file related to specific 
session. But the iterator is created over all the files. So the 
FileNotFoundException is thrown when multiple sessions are acting on same table 
and the first session cleans up its data which is being read by the second 
session.

This is fixed as part of HIVE-24679
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}
 

The below path is fixed by HIVE-24682
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getFullDPSpecs(Utilities.java:2971) 
~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}

  was:
As part of the direct insert optimisation (same issue is there for MM table 
also, without direct insert 

[jira] [Updated] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException

2022-01-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25877:
---
Description: 
As part of the direct insert optimisation (same issue is there for MM table 
also, without direct insert optimisation), the files from Tez jobs are moved to 
the table directory for ACID tables. Then the duplicate removal is done. Each 
session scan through the tables and cleans up the file related to specific 
session. But the iterator is created over all the files. So the 
FileNotFoundException is thrown when multiple sessions are acting on same table 
and the first session cleans up its data which is being read by the second 
session.
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}
 

The below path is fixed by HIVE-24682
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getFullDPSpecs(Utilities.java:2971) 
~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}

  was:
As part of the direct insert optimisation (same issue is there for MM table 
also, without direct insert optimisation), the files from Tez jobs are 

[jira] [Assigned] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException

2022-01-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25877:
--


> Load table from concurrent thread causes FileNotFoundException
> --
>
> Key: HIVE-25877
> URL: https://issues.apache.org/jira/browse/HIVE-25877
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> As part of the direct insert optimisation (same issue is there for MM table 
> also, without direct insert optimisation), the files from Tez jobs are moved 
> to the table directory for ACID tables. Then the duplicate removal is done. 
> Each session scan through the tables and cleans up the file related to 
> specific session. But the iterator is created over all the files. So the 
> FileNotFoundException is thrown when multiple sessions are acting on same 
> table and the first session cleans up its data which is being read by the 
> second session.
> {code:java}
> Caused by: java.io.FileNotFoundException: File 
> hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_
>  does not exist.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816)
>  ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}
>  
> {code:java}
> Caused by: java.io.FileNotFoundException: File 
> hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54
>  does not exist.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
>  ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
> ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
>         at 
> org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
>  

[jira] [Resolved] (HIVE-25868) AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables

2022-01-16 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25868.

Resolution: Duplicate

> AcidHouseKeeperService fails to purgeCompactionHistory if the entries in 
> COMPLETED_COMPACTIONS tables 
> --
>
> Key: HIVE-25868
> URL: https://issues.apache.org/jira/browse/HIVE-25868
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore, Standalone Metastore
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> To purge the entries, prepared statement is created. If the number of entries 
> in the prepared statement goes beyond the limit of backend db (for postgres 
> it around 32k), then the operation fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-25868) AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables

2022-01-16 Thread mahesh kumar behera (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476961#comment-17476961
 ] 

mahesh kumar behera commented on HIVE-25868:


[~sankarh] Yes, it's a duplicate. Will close the issue. Thanks for pointing out.

> AcidHouseKeeperService fails to purgeCompactionHistory if the entries in 
> COMPLETED_COMPACTIONS tables 
> --
>
> Key: HIVE-25868
> URL: https://issues.apache.org/jira/browse/HIVE-25868
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore, Standalone Metastore
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> To purge the entries, prepared statement is created. If the number of entries 
> in the prepared statement goes beyond the limit of backend db (for postgres 
> it around 32k), then the operation fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HIVE-25868) AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables

2022-01-16 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25868:
--


> AcidHouseKeeperService fails to purgeCompactionHistory if the entries in 
> COMPLETED_COMPACTIONS tables 
> --
>
> Key: HIVE-25868
> URL: https://issues.apache.org/jira/browse/HIVE-25868
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore, Standalone Metastore
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> To purge the entries, prepared statement is created. If the number of entries 
> in the prepared statement goes beyond the limit of backend db (for postgres 
> it around 32k), then the operation fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function

2022-01-12 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25864:
--


> Hive query optimisation creates wrong plan for predicate pushdown with 
> windowing function 
> --
>
> Key: HIVE-25864
> URL: https://issues.apache.org/jira/browse/HIVE-25864
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> In case of a query with windowing function, the deterministic predicates are 
> pushed down below the window function. Before pushing down, the predicate is 
> converted to refer the project operator values. But the same conversion is 
> done again while creating the project and thus causing wrong plan generation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HIVE-25808) Analyse table does not fail for non existing partitions

2021-12-14 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25808:
--

Assignee: mahesh kumar behera

> Analyse table does not fail for non existing partitions
> ---
>
> Key: HIVE-25808
> URL: https://issues.apache.org/jira/browse/HIVE-25808
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> If all the column names are given in the analyse command , then the query 
> fails. But if all the partition column values are not given then its not 
> failing.
> analyze table tbl partition *(fld1 = 2, fld2 = 3)* COMPUTE STATISTICS FOR 
> COLUMNS – This will fail with SemanticException, if partition corresponds to 
> fld1 = 2, fld2 = 3 does not exists. But analyze table tbl partition *(fld1 = 
> 2)* COMPUTE STATISTICS FOR COLUMNS, this will not fail and it will compute 
> stats for whole table.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-25540) Enable batch update of column stats only for MySql and Postgres

2021-12-14 Thread mahesh kumar behera (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459241#comment-17459241
 ] 

mahesh kumar behera commented on HIVE-25540:


[~zabetak] 

The batch update uses direct SQL to optimize the number of backend database 
calls. Some of the SQL used are not supported by Oracle. So we need to put a 
check to go via DN if the backend DB is Oracle. Currently we have tested only 
in Mysql and Postgres. Batch update  feature is not yet shipped.

> Enable batch update of column stats only for MySql and Postgres 
> 
>
> Key: HIVE-25540
> URL: https://issues.apache.org/jira/browse/HIVE-25540
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The batch updation of partition column stats using direct sql is tested only 
> for MySql and Postgres.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HIVE-25778) Hive DB creation is failing when MANAGEDLOCATION is specified with existing location

2021-12-14 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25778.

Resolution: Won't Fix

Closing this Jira as supporting this scenario may lead to data loss/corruption 
if user is not very careful. 

> Hive DB creation is failing when MANAGEDLOCATION is specified with existing 
> location
> 
>
> Key: HIVE-25778
> URL: https://issues.apache.org/jira/browse/HIVE-25778
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Metastore
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As part of HIVE-23387 check is added to restrict user from creating database 
> with managed table location, if the location is already present. This was not 
> the case. As this is causing backward compatibility issue, the check needs to 
> be removed.
>  
> {code:java}
> if (madeManagedDir) {
>   LOG.info("Created database path in managed directory " + dbMgdPath);
> } else {
>   throw new MetaException(
>   "Unable to create database managed directory " + dbMgdPath + ", failed 
> to create database " + db.getName());
> }  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HIVE-25778) Hive DB creation is failing when MANAGEDLOCATION is specified with existing location

2021-12-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25778:
--


> Hive DB creation is failing when MANAGEDLOCATION is specified with existing 
> location
> 
>
> Key: HIVE-25778
> URL: https://issues.apache.org/jira/browse/HIVE-25778
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Metastore
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> As part of HIVE-23387 check is added to restrict user from creating database 
> with managed table location, if the location is already present. This was not 
> the case. As this is causing backward compatibility issue, the check needs to 
> be removed.
>  
> {code:java}
> if (madeManagedDir) {
>   LOG.info("Created database path in managed directory " + dbMgdPath);
> } else {
>   throw new MetaException(
>   "Unable to create database managed directory " + dbMgdPath + ", failed 
> to create database " + db.getName());
> }  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HIVE-25638) Select returns deleted records in Hive ACID table

2021-10-29 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25638:
---
Description: Hive stores the stripe stats in the ORC files. During select, 
these stats are used to create the SARG. The SARG is used to reduce the records 
read from the delete-delta files. Currently, in case where the number of 
stripes are more than 1, the SARG generated is not proper as it uses the first 
stripe index for both min and max key interval. The max key interval should be 
obtained from last stripe index. This cause some valid deleted records to be 
skipped. And those records are return to the user. We need the last stripe here 
instead of the first one, is the fact the keys are ordered in the file.  (was: 
Hive stores the stripe stats in the ORC files. During select, these stats are 
used to create the SARG. The SARG is used to reduce the records read from the 
delete-delta files. Currently, in case where the number of stripes are more 
than 1, the SARG generated is not proper as it uses the first stripe index for 
both min and max key interval. The max key interval should be obtained from 
last stripe index. This cause some valid deleted records to be skipped. And 
those records are return to the user.)

> Select returns deleted records in Hive ACID table
> -
>
> Key: HIVE-25638
> URL: https://issues.apache.org/jira/browse/HIVE-25638
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive stores the stripe stats in the ORC files. During select, these stats are 
> used to create the SARG. The SARG is used to reduce the records read from the 
> delete-delta files. Currently, in case where the number of stripes are more 
> than 1, the SARG generated is not proper as it uses the first stripe index 
> for both min and max key interval. The max key interval should be obtained 
> from last stripe index. This cause some valid deleted records to be skipped. 
> And those records are return to the user. We need the last stripe here 
> instead of the first one, is the fact the keys are ordered in the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25638) Select returns deleted records in Hive ACID table

2021-10-24 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25638:
---
Description: Hive stores the stripe stats in the ORC files. During select, 
these stats are used to create the SARG. The SARG is used to reduce the records 
read from the delete-delta files. Currently, in case where the number of 
stripes are more than 1, the SARG generated is not proper as it uses the first 
stripe index for both min and max key interval. The max key interval should be 
obtained from last stripe index. This cause some valid deleted records to be 
skipped. And those records are return to the user.  (was: Hive stores the 
stripe stats in the ORC files. During select, these stats are used to create 
the SARG. The SARG is used to reduce the records read from the delete-delta 
files. Currently, in case where the number of stripes are more than 1, the SARG 
generated is not proper as it uses the first stripe index for both min and max 
key interval. The max key interval should be obtained from last stripe index.)

> Select returns deleted records in Hive ACID table
> -
>
> Key: HIVE-25638
> URL: https://issues.apache.org/jira/browse/HIVE-25638
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Hive stores the stripe stats in the ORC files. During select, these stats are 
> used to create the SARG. The SARG is used to reduce the records read from the 
> delete-delta files. Currently, in case where the number of stripes are more 
> than 1, the SARG generated is not proper as it uses the first stripe index 
> for both min and max key interval. The max key interval should be obtained 
> from last stripe index. This cause some valid deleted records to be skipped. 
> And those records are return to the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25638) Select returns deleted records in Hive ACID table

2021-10-24 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25638:
---
Summary: Select returns deleted records in Hive ACID table  (was: Select 
returns the deleted records in Hive ACID table)

> Select returns deleted records in Hive ACID table
> -
>
> Key: HIVE-25638
> URL: https://issues.apache.org/jira/browse/HIVE-25638
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Hive stores the stripe stats in the ORC files. During select, these stats are 
> used to create the SARG. The SARG is used to reduce the records read from the 
> delete-delta files. Currently, in case where the number of stripes are more 
> than 1, the SARG generated is not proper as it uses the first stripe index 
> for both min and max key interval. The max key interval should be obtained 
> from last stripe index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25638) Select returns the deleted records in Hive ACID table

2021-10-24 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25638:
--


> Select returns the deleted records in Hive ACID table
> -
>
> Key: HIVE-25638
> URL: https://issues.apache.org/jira/browse/HIVE-25638
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Hive stores the stripe stats in the ORC files. During select, these stats are 
> used to create the SARG. The SARG is used to reduce the records read from the 
> delete-delta files. Currently, in case where the number of stripes are more 
> than 1, the SARG generated is not proper as it uses the first stripe index 
> for both min and max key interval. The max key interval should be obtained 
> from last stripe index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25540) Enable batch updation of column stats only for MySql and Postgres

2021-09-20 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25540:
--


> Enable batch updation of column stats only for MySql and Postgres 
> --
>
> Key: HIVE-25540
> URL: https://issues.apache.org/jira/browse/HIVE-25540
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The batch updation of partition column stats using direct sql is tested only 
> for MySql and Postgres.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down

2021-09-20 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25527.

Resolution: Fixed

> LLAP Scheduler task exits with fatal error if the executor node is down
> ---
>
> Key: HIVE-25527
> URL: https://issues.apache.org/jira/browse/HIVE-25527
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In case the executor host has gone down, activeInstances will be updated with 
> null. So we need to check for empty/null values before accessing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25417) Null bit vector is not handled while getting the stats for Postgres backend

2021-09-20 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25417.

Resolution: Fixed

> Null bit vector is not handled while getting the stats for Postgres backend
> ---
>
> Key: HIVE-25417
> URL: https://issues.apache.org/jira/browse/HIVE-25417
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> While adding stats with null bit vector, a special string "HL" is added as 
> Postgres does not support null value for byte columns. But while getting the 
> stats, the conversion to null is not done. This is causing failure during 
> deserialisation of bit vector field if the existing stats is used for merge.
>  
> {code:java}
>  The input stream is not a HyperLogLog stream.  7276-1 instead of 727676 or 
> 7077^Mat 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.checkMagicString(HyperLogLogUtils.java:349)^M
>  at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:139)^M
>at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:213)^M
>at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:227)^M
>at 
> org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator(NumDistinctValueEstimatorFactory.java:53)^M
>   at 
> org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.updateNdvEstimator(LongColumnStatsDataInspector.java:124)^M
>   at 
> org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.getNdvEstimator(LongColumnStatsDataInspector.java:107)^M
>  at 
> org.apache.hadoop.hive.metastore.columnstats.merge.LongColumnStatsMerger.merge(LongColumnStatsMerger.java:36)^M
>   at 
> org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.mergeColStats(MetaStoreUtils.java:1174)^M
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updateTableColumnStatsWithMerge(HiveMetaStore.java:8934)^M
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.set_aggr_stats_for(HiveMetaStore.java:8800)^M
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^M
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^M
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^M
>   at java.lang.reflect.Method.invoke(Method.java:498)^M   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160)^M
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121)^M
> at com.sun.proxy.$Proxy35.set_aggr_stats_for(Unknown Source)^M  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20489)^M
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20473)^M
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)^M 
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)^M   at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:643)^M
>at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:638)^M
>at java.security.AccessController.doPrivileged(Native Method)^M at 
> javax.security.auth.Subject.doAs(Subject.java:422)^M at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)^M
>at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:638)^M
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)^M
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M
> at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.

2021-09-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25527:
--


> LLAP Scheduler task exits with fatal error if the executor node is down.
> 
>
> Key: HIVE-25527
> URL: https://issues.apache.org/jira/browse/HIVE-25527
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> In case the executor host has gone down, activeInstances will be updated with 
> null. So we need to check for empty/null values before accessing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25414) Optimise Hive::addWriteNotificationLog: Reduce FS call per notification

2021-09-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25414:
--

Assignee: mahesh kumar behera

> Optimise Hive::addWriteNotificationLog: Reduce FS call per notification
> ---
>
> Key: HIVE-25414
> URL: https://issues.apache.org/jira/browse/HIVE-25414
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> AddWriteNotification is slow due to FS interactions (i.e to get the set of 
> insert file information). This can be avoided as FileStatus can be passed 
> instead of Path from parent methods.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3572]
>  
> [https://github.com/apache/hive/blob/96b39cd5190f0cfadb677e3f3b7ead1d663921b2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3620]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25438) Update partition column stats fails with invalid syntax error for MySql

2021-09-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25438.

Resolution: Fixed

> Update partition column stats fails with invalid syntax error for MySql
> ---
>
> Key: HIVE-25438
> URL: https://issues.apache.org/jira/browse/HIVE-25438
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The quotes are not supported by mysql if  ANSI_QUOTES is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25414) Optimise Hive::addWriteNotificationLog: Reduce FS call per notification

2021-09-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25414.

Resolution: Fixed

> Optimise Hive::addWriteNotificationLog: Reduce FS call per notification
> ---
>
> Key: HIVE-25414
> URL: https://issues.apache.org/jira/browse/HIVE-25414
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> AddWriteNotification is slow due to FS interactions (i.e to get the set of 
> insert file information). This can be avoided as FileStatus can be passed 
> instead of Path from parent methods.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3572]
>  
> [https://github.com/apache/hive/blob/96b39cd5190f0cfadb677e3f3b7ead1d663921b2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3620]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25438) Update partition column stats fails with invalid syntax error for MySql

2021-08-08 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25438:
--


> Update partition column stats fails with invalid syntax error for MySql
> ---
>
> Key: HIVE-25438
> URL: https://issues.apache.org/jira/browse/HIVE-25438
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The quotes are not supported by mysql if  ANSI_QUOTES is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25431) Enable CBO for null safe equality operator.

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25431.

Resolution: Fixed

> Enable CBO for null safe equality operator.
> ---
>
> Key: HIVE-25431
> URL: https://issues.apache.org/jira/browse/HIVE-25431
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The CBO is disabled for null safe equality (<=>)  operator. This is causing 
> the sub optimal join execution  for some queries. As null safe equality is 
> supported by joins, the CBO can be enabled for it. There will be issues with 
> join reordering as Hive does not support join reordering for null safe 
> equality operator. But with CBO enabled the join plan will be better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25432) Support Join reordering for null safe equality operator.

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25432:
---
Parent: (was: HIVE-25431)
Issue Type: Bug  (was: Sub-task)

> Support Join reordering for null safe equality operator.
> 
>
> Key: HIVE-25432
> URL: https://issues.apache.org/jira/browse/HIVE-25432
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Priority: Major
>
> Support Join reordering for null safe equality operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25431) Enable CBO for null safe equality operator.

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25431:
--


> Enable CBO for null safe equality operator.
> ---
>
> Key: HIVE-25431
> URL: https://issues.apache.org/jira/browse/HIVE-25431
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The CBO is disabled for null safe equality (<=>)  operator. This is causing 
> the sub optimal join execution  for some queries. As null safe equality is 
> supported by joins, the CBO can be enabled for it. There will be issues with 
> join reordering as Hive does not support join reordering for null safe 
> equality operator. But with CBO enabled the join plan will be better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25373) Modify buildColumnStatsDesc to send configured number of stats for updation

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25373.

Resolution: Fixed

> Modify buildColumnStatsDesc to send configured number of stats for updation
> ---
>
> Key: HIVE-25373
> URL: https://issues.apache.org/jira/browse/HIVE-25373
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The number of stats sent for updation should be controlled to avoid thrift 
> error in case the size exceeds the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25342) Optimize set_aggr_stats_for for mergeColStats path.

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25342.

Resolution: Fixed

> Optimize set_aggr_stats_for for mergeColStats path. 
> 
>
> Key: HIVE-25342
> URL: https://issues.apache.org/jira/browse/HIVE-25342
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The optimisation used for normal path to use direct sql can also be used for 
> mergeColStats
> path. The stats to be updated can be accumulated in a temp list and that list 
> can be used to update the stats in a batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition..

2021-08-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25205.

Resolution: Fixed

> Reduce overhead of adding write notification log during batch loading of 
> partition..
> 
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25417) Null bit vector is not handled while getting the stats for Postgres backend

2021-08-02 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25417:
--


> Null bit vector is not handled while getting the stats for Postgres backend
> ---
>
> Key: HIVE-25417
> URL: https://issues.apache.org/jira/browse/HIVE-25417
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> While adding stats with null bit vector, a special string "HL" is added as 
> Postgres does not support null value for byte columns. But while getting the 
> stats, the conversion to null is not done. This is causing failure during 
> deserialisation of bit vector field if the existing stats is used for merge.
>  
> {code:java}
>  The input stream is not a HyperLogLog stream.  7276-1 instead of 727676 or 
> 7077^Mat 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.checkMagicString(HyperLogLogUtils.java:349)^M
>  at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:139)^M
>at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:213)^M
>at 
> org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:227)^M
>at 
> org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator(NumDistinctValueEstimatorFactory.java:53)^M
>   at 
> org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.updateNdvEstimator(LongColumnStatsDataInspector.java:124)^M
>   at 
> org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.getNdvEstimator(LongColumnStatsDataInspector.java:107)^M
>  at 
> org.apache.hadoop.hive.metastore.columnstats.merge.LongColumnStatsMerger.merge(LongColumnStatsMerger.java:36)^M
>   at 
> org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.mergeColStats(MetaStoreUtils.java:1174)^M
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updateTableColumnStatsWithMerge(HiveMetaStore.java:8934)^M
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.set_aggr_stats_for(HiveMetaStore.java:8800)^M
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^M
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^M
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^M
>   at java.lang.reflect.Method.invoke(Method.java:498)^M   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160)^M
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121)^M
> at com.sun.proxy.$Proxy35.set_aggr_stats_for(Unknown Source)^M  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20489)^M
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20473)^M
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)^M 
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)^M   at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:643)^M
>at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:638)^M
>at java.security.AccessController.doPrivileged(Native Method)^M at 
> javax.security.auth.Subject.doAs(Subject.java:422)^M at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)^M
>at 
> org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:638)^M
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)^M
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M
> at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25373) Modify buildColumnStatsDesc to send configured number of stats for updation

2021-07-22 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25373:
--


> Modify buildColumnStatsDesc to send configured number of stats for updation
> ---
>
> Key: HIVE-25373
> URL: https://issues.apache.org/jira/browse/HIVE-25373
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The number of stats sent for updation should be controlled to avoid thrift 
> error in case the size exceeds the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25342) Optimize set_aggr_stats_for for mergeColStats path.

2021-07-18 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25342:
--


> Optimize set_aggr_stats_for for mergeColStats path. 
> 
>
> Key: HIVE-25342
> URL: https://issues.apache.org/jira/browse/HIVE-25342
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The optimisation used for normal path to use direct sql can also be used for 
> mergeColStats
> path. The stats to be updated can be accumulated in a temp list and that list 
> can be used to update the stats in a batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25225) Update column stat throws NPE if direct sql is disabled

2021-07-04 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25225.

Resolution: Fixed

> Update column stat throws NPE if direct sql is disabled
> ---
>
> Key: HIVE-25225
> URL: https://issues.apache.org/jira/browse/HIVE-25225
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In case direct sql is disabled, the MetaStoreDirectSql object is not 
> initialised and thats causing NPE. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25251) Reduce overhead of adding partitions during batch loading of partitions.

2021-06-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25251:
--


> Reduce overhead of adding partitions during batch loading of partitions.
> 
>
> Key: HIVE-25251
> URL: https://issues.apache.org/jira/browse/HIVE-25251
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The add partitions call done to HMS does a serial execution of data nucleus 
> calls to add the partitions to backend DB. This can be further optimised by 
> batching those sql statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25204) Reduce overhead of adding notification log for update partition column statistics

2021-06-15 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25204.

Resolution: Fixed

> Reduce overhead of adding notification log for update partition column 
> statistics
> -
>
> Key: HIVE-25204
> URL: https://issues.apache.org/jira/browse/HIVE-25204
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: perfomance, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The notification logs for partition column statistics can be optimised by 
> adding them in batch. In the current implementation its done one by one 
> causing multiple sql execution in the backend RDBMS. These SQL executions can 
> be batched to reduce the execution time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25225) Update column stat throws NPE if direct sql is disabled

2021-06-09 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25225:
--


> Update column stat throws NPE if direct sql is disabled
> ---
>
> Key: HIVE-25225
> URL: https://issues.apache.org/jira/browse/HIVE-25225
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> In case direct sql is disabled, the MetaStoreDirectSql object is not 
> initialised and thats causing NPE. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24073) Execution exception in sort-merge semijoin

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24073.

Resolution: Fixed

> Execution exception in sort-merge semijoin
> --
>
> Key: HIVE-24073
> URL: https://issues.apache.org/jira/browse/HIVE-24073
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Reporter: Jesus Camacho Rodriguez
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Working on HIVE-24041, we trigger an additional SJ conversion that leads to 
> this exception at execution time:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1063)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:685)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:707)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:462)
>   ... 16 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1037)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1060)
>   ... 22 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to 
> overwrite nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:564)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:243)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:887)
>   at 
> org.apache.hadoop.hive.ql.exec.TezDummyStoreOperator.process(TezDummyStoreOperator.java:49)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:887)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1003)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1020)
>   ... 23 more
> {code}
> To reproduce, just set {{hive.auto.convert.sortmerge.join}} to {{true}} in 
> the last query in {{auto_sortmerge_join_10.q}} after HIVE-24041 has been 
> merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25205) Reduce overhead of partition column stat updation during batch loading of partitions.

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25205:
---
Summary: Reduce overhead of partition column stat updation during batch 
loading of partitions.  (was: Reduce overhead of adding write notification log 
during batch loading of partition.)

> Reduce overhead of partition column stat updation during batch loading of 
> partitions.
> -
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24663) Reduce overhead of partition column stat updation during batch loading of partitions.

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24663:
---
Summary: Reduce overhead of partition column stat updation during batch 
loading of partitions.  (was: Reduce overhead of partition column stats 
updation.)

> Reduce overhead of partition column stat updation during batch loading of 
> partitions.
> -
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance, pull-request-available
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24663) Reduce overhead of partition column stat updation during batch loading of partitions.

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24663.

Resolution: Fixed

> Reduce overhead of partition column stat updation during batch loading of 
> partitions.
> -
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance, pull-request-available
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition..

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reopened HIVE-25205:


> Reduce overhead of adding write notification log during batch loading of 
> partition..
> 
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition..

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25205:
---
Summary: Reduce overhead of adding write notification log during batch 
loading of partition..  (was: Reduce overhead of partition column stat updation 
during batch loading of partitions.)

> Reduce overhead of adding write notification log during batch loading of 
> partition..
> 
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25205) Reduce overhead of partition column stat updation during batch loading of partitions.

2021-06-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-25205.

Resolution: Fixed

> Reduce overhead of partition column stat updation during batch loading of 
> partitions.
> -
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24284) NPE when parsing druid logs using Hive

2021-06-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24284.

Resolution: Fixed

> NPE when parsing druid logs using Hive
> --
>
> Key: HIVE-24284
> URL: https://issues.apache.org/jira/browse/HIVE-24284
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As per current Sys-logger parser, its always expecting a valid proc id. But 
> as per RFC3164 and RFC5424, the proc id can be skipped.So hive should handled 
> it by using NILVALUE/empty string in case the proc id is null.
>  
> {code:java}
> Caused by: java.lang.NullPointerException: null
> at java.lang.String.(String.java:566)
> at 
> org.apache.hadoop.hive.ql.log.syslog.SyslogParser.createEvent(SyslogParser.java:361)
> at 
> org.apache.hadoop.hive.ql.log.syslog.SyslogParser.readEvent(SyslogParser.java:326)
> at 
> org.apache.hadoop.hive.ql.log.syslog.SyslogSerDe.deserialize(SyslogSerDe.java:95)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition.

2021-06-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25205:
---
Labels: performance  (was: )

> Reduce overhead of adding write notification log during batch loading of 
> partition.
> ---
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25204) Reduce overhead of adding notification log for update partition column statistics

2021-06-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25204:
---
Labels: perfomance  (was: )

> Reduce overhead of adding notification log for update partition column 
> statistics
> -
>
> Key: HIVE-25204
> URL: https://issues.apache.org/jira/browse/HIVE-25204
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: perfomance
>
> The notification logs for partition column statistics can be optimised by 
> adding them in batch. In the current implementation its done one by one 
> causing multiple sql execution in the backend RDBMS. These SQL executions can 
> be batched to reduce the execution time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition.

2021-06-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25205:
--


> Reduce overhead of adding write notification log during batch loading of 
> partition.
> ---
>
> Key: HIVE-25205
> URL: https://issues.apache.org/jira/browse/HIVE-25205
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> During batch loading of partition the write notification logs are added for 
> each partition added. This is causing delay in execution as the call to HMS 
> is done for each partition. This can be optimised by adding a new API in HMS 
> to send a batch of partition and then this batch can be added together to the 
> backend database. Once we have a batch of notification log, at HMS side, code 
> can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25204) Reduce overhead of adding notification log for update partition column statistics

2021-06-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25204:
--


> Reduce overhead of adding notification log for update partition column 
> statistics
> -
>
> Key: HIVE-25204
> URL: https://issues.apache.org/jira/browse/HIVE-25204
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The notification logs for partition column statistics can be optimised by 
> adding them in batch. In the current implementation its done one by one 
> causing multiple sql execution in the backend RDBMS. These SQL executions can 
> be batched to reduce the execution time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25181) Analyse and optimise execution time for batch loading of partitions.

2021-05-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25181:
---
Labels: performance  (was: )

> Analyse and optimise execution time for batch loading of partitions.
> 
>
> Key: HIVE-25181
> URL: https://issues.apache.org/jira/browse/HIVE-25181
> Project: Hive
>  Issue Type: Task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> When load partition is done in batch, of more than 10k, the execution time is 
> exceeding hours. This may be an issue for ETL type of work load. This task is 
> to track the issues and fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24663) Reduce overhead of partition column stats updation.

2021-05-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24663:
---
Parent: HIVE-25181
Issue Type: Sub-task  (was: Improvement)

> Reduce overhead of partition column stats updation.
> ---
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance, pull-request-available
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24663) Reduce overhead of partition column stats updation.

2021-05-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24663:
---
Summary: Reduce overhead of partition column stats updation.  (was: Batch 
process in ColStatsProcessor for partitions.)

> Reduce overhead of partition column stats updation.
> ---
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance, pull-request-available
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25181) Analyse and optimise execution time for batch loading of partitions.

2021-05-31 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25181:
--


> Analyse and optimise execution time for batch loading of partitions.
> 
>
> Key: HIVE-25181
> URL: https://issues.apache.org/jira/browse/HIVE-25181
> Project: Hive
>  Issue Type: Task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> When load partition is done in batch, of more than 10k, the execution time is 
> exceeding hours. This may be an issue for ETL type of work load. This task is 
> to track the issues and fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24883) Support ARRAY/STRUCT types in equality SMB and Common merge join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24883.

Resolution: Fixed

> Support ARRAY/STRUCT  types in equality SMB and Common merge join
> -
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array and struct type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24883) Support ARRAY/STRUCT types in equality SMB and Common merge join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24883:
---
Parent: HIVE-20962
Issue Type: Sub-task  (was: Bug)

> Support ARRAY/STRUCT  types in equality SMB and Common merge join
> -
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array and struct type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-2508) Join on union type fails

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-2508:
--
Parent: HIVE-20962
Issue Type: Sub-task  (was: Bug)

> Join on union type fails
> 
>
> Key: HIVE-2508
> URL: https://issues.apache.org/jira/browse/HIVE-2508
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Processor
>Reporter: Ashutosh Chauhan
>Priority: Major
>  Labels: uniontype
>
> {code}
> hive> CREATE TABLE DEST1(key UNIONTYPE, value BIGINT) STORED 
> AS TEXTFILE;
> OK
> Time taken: 0.076 seconds
> hive> CREATE TABLE DEST2(key UNIONTYPE, value BIGINT) STORED 
> AS TEXTFILE;
> OK
> Time taken: 0.034 seconds
> hive> SELECT * FROM DEST1 JOIN DEST2 on (DEST1.key = DEST2.key);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25042) Add support for map data type in Common merge join and SMB Join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25042:
---
Parent: HIVE-20962
Issue Type: Sub-task  (was: Bug)

> Add support for map data type in Common merge join and SMB Join
> ---
>
> Key: HIVE-25042
> URL: https://issues.apache.org/jira/browse/HIVE-25042
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Priority: Major
>
> Merge join results depends on the underlying sorter used by the mapper task 
> as we need to judge the direction after each key comparison. So the 
> comparison done during join has to match the way the records are sorted by 
> the mapper. As per the sorter used by mapper task (PipelinedSorter), 
> hash-maps with same key-value pair in different order are not equal. So the 
> merge join also behaves the same way. But map join treats them as equal. We 
> have to modify the pipelined sorter code to handle the map datatype. Then 
> support has to be added in the Join code to support map types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25042) Add support for map data type in Common merge join and SMB Join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-25042:
---
Parent: (was: HIVE-24883)
Issue Type: Bug  (was: Sub-task)

> Add support for map data type in Common merge join and SMB Join
> ---
>
> Key: HIVE-25042
> URL: https://issues.apache.org/jira/browse/HIVE-25042
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Reporter: mahesh kumar behera
>Priority: Major
>
> Merge join results depends on the underlying sorter used by the mapper task 
> as we need to judge the direction after each key comparison. So the 
> comparison done during join has to match the way the records are sorted by 
> the mapper. As per the sorter used by mapper task (PipelinedSorter), 
> hash-maps with same key-value pair in different order are not equal. So the 
> merge join also behaves the same way. But map join treats them as equal. We 
> have to modify the pipelined sorter code to handle the map datatype. Then 
> support has to be added in the Join code to support map types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24995) Add support for complex type operator in Join with non equality condition

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24995:
---
Parent: HIVE-20962
Issue Type: Sub-task  (was: Bug)

> Add support for complex type operator in Join with non equality condition 
> --
>
> Key: HIVE-24995
> URL: https://issues.apache.org/jira/browse/HIVE-24995
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> This subtask is specifically to support non equal comparison like greater 
> than, smaller than etc as join condition. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24995) Add support for complex type operator in Join with non equality condition

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24995:
---
Parent: (was: HIVE-24883)
Issue Type: Bug  (was: Sub-task)

> Add support for complex type operator in Join with non equality condition 
> --
>
> Key: HIVE-24995
> URL: https://issues.apache.org/jira/browse/HIVE-24995
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> This subtask is specifically to support non equal comparison like greater 
> than, smaller than etc as join condition. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24883) Support ARRAY/STRUCT types in equality SMB and Common merge join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24883:
---
Description: Hive fails to execute joins on array type columns as the 
comparison functions are not able to handle array and struct type columns.     
(was: Hive fails to execute joins on array type columns as the comparison 
functions are not able to handle array type columns.   )

> Support ARRAY/STRUCT  types in equality SMB and Common merge join
> -
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array and struct type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24883) Support ARRAY/STRUCT types in equality SMB and Common merge join

2021-05-23 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24883:
---
Summary: Support ARRAY/STRUCT  types in equality SMB and Common merge join  
(was: Add support for complex types columns in Hive Joins)

> Support ARRAY/STRUCT  types in equality SMB and Common merge join
> -
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25142) Rehashing in map join fast hash table causing corruption for large keys

2021-05-19 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-25142:
--


> Rehashing in map join fast hash table  causing corruption for large keys
> 
>
> Key: HIVE-25142
> URL: https://issues.apache.org/jira/browse/HIVE-25142
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> In map join the hash table is created using the keys. To support rehashing, 
> the keys are stored in write buffer. The hash table contains the offset of 
> the keys along with the hash code. When rehashing is done, the offset is 
> extracted from the hash table and then hash code is generated again. For 
> large keys of size greater than 255, the key length is also stored along with 
> the key. In case of fast hash table implementation the way key is extracted 
> is not proper. There is a code bug and thats causing the wrong key to be 
> extracted and causing wrong hash code generation. This is causing the 
> corruption in the hash table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24663) Batch process in ColStatsProcessor for partitions.

2021-05-13 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24663:
---
Summary: Batch process in ColStatsProcessor for partitions.  (was: Batch 
process in ColStatsProcessor)

> Batch process in ColStatsProcessor for partitions.
> --
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24663) Batch process in ColStatsProcessor

2021-05-13 Thread mahesh kumar behera (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343755#comment-17343755
 ] 

mahesh kumar behera commented on HIVE-24663:


The original issue with the slowness in because of the way column stats are 
processed at HMS. The stats are updated one by one at HMS using JDO 
connections. This was resulting into performance issues as JDO does lots of 
conversion. So the proper fix is to batch the processing into single sql 
statements and execute it using direct sql. 

> Batch process in ColStatsProcessor
> --
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24663) Batch process in ColStatsProcessor

2021-05-13 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24663:
--

Assignee: mahesh kumar behera

> Batch process in ColStatsProcessor
> --
>
> Key: HIVE-24663
> URL: https://issues.apache.org/jira/browse/HIVE-24663
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: performance
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24996) Conversion of PIG script with multiple store causing the merging of multiple sql statements

2021-04-09 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24996:
--


> Conversion of PIG script with multiple store causing the merging of multiple 
> sql statements
> ---
>
> Key: HIVE-24996
> URL: https://issues.apache.org/jira/browse/HIVE-24996
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The sql write is not reset after sql statement is converted. This is causing 
> the next sql statements to be merged with the previous one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24995) Add support for complex type operator in Join with non equality condition

2021-04-09 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24995:
--


> Add support for complex type operator in Join with non equality condition 
> --
>
> Key: HIVE-24995
> URL: https://issues.apache.org/jira/browse/HIVE-24995
> Project: Hive
>  Issue Type: Sub-task
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> This subtask is specifically to support non equal comparison like greater 
> than, smaller than etc as join condition. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24989) Support vectorisation of join with key columns of complex types

2021-04-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24989:
---
Description: 
Support for complex type is not present in add key.
{code:java}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected column 
vector type LISTCaused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
Unexpected column vector type LIST at 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnSetInfo.addKey(VectorColumnSetInfo.java:138)
 at 
org.apache.hadoop.hive.ql.exec.vector.wrapper.VectorHashKeyWrapperBatch.compileKeyWrapperBatch(VectorHashKeyWrapperBatch.java:913)
 at 
org.apache.hadoop.hive.ql.exec.vector.wrapper.VectorHashKeyWrapperBatch.compileKeyWrapperBatch(VectorHashKeyWrapperBatch.java:894)
 at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.initializeOp(VectorMapJoinOperator.java:137)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:360) at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:549) at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:503) 
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:369) at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:332)
  {code}

  was:Hive fails to execute joins on array type columns as the comparison 
functions are not able to handle array type columns.   


> Support vectorisation of join with key columns of complex types
> ---
>
> Key: HIVE-24989
> URL: https://issues.apache.org/jira/browse/HIVE-24989
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support for complex type is not present in add key.
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected 
> column vector type LISTCaused by: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected column vector 
> type LIST at 
> org.apache.hadoop.hive.ql.exec.vector.VectorColumnSetInfo.addKey(VectorColumnSetInfo.java:138)
>  at 
> org.apache.hadoop.hive.ql.exec.vector.wrapper.VectorHashKeyWrapperBatch.compileKeyWrapperBatch(VectorHashKeyWrapperBatch.java:913)
>  at 
> org.apache.hadoop.hive.ql.exec.vector.wrapper.VectorHashKeyWrapperBatch.compileKeyWrapperBatch(VectorHashKeyWrapperBatch.java:894)
>  at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.initializeOp(VectorMapJoinOperator.java:137)
>  at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:360) at 
> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:549) at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:503) 
> at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:369) at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:332)
>   {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24989) Support vectorisation of join with key columns of complex types

2021-04-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24989:
--


> Support vectorisation of join with key columns of complex types
> ---
>
> Key: HIVE-24989
> URL: https://issues.apache.org/jira/browse/HIVE-24989
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24988) Add support for complex types columns for Dynamic Partition pruning Optimisation

2021-04-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24988:
---
Description: DynamicPartitionPruningOptimization fails for complex types.   
 (was: Hive fails to execute joins on array type columns as the comparison 
functions are not able to handle array type columns.   )

> Add support for complex types columns for Dynamic Partition pruning 
> Optimisation
> 
>
> Key: HIVE-24988
> URL: https://issues.apache.org/jira/browse/HIVE-24988
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> DynamicPartitionPruningOptimization fails for complex types.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24988) Add support for complex types columns for Dynamic Partition pruning Optimisation

2021-04-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24988:
--


> Add support for complex types columns for Dynamic Partition pruning 
> Optimisation
> 
>
> Key: HIVE-24988
> URL: https://issues.apache.org/jira/browse/HIVE-24988
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24883) Add support for complex types columns in Hive Joins

2021-04-07 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24883:
---
Summary: Add support for complex types columns in Hive Joins  (was: Add 
support for array type columns in Hive Joins)

> Add support for complex types columns in Hive Joins
> ---
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24977) Query compilation failing with NPE during reduce sink deduplication

2021-04-06 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24977.

Resolution: Duplicate

> Query compilation failing with NPE during reduce sink deduplication
> ---
>
> Key: HIVE-24977
> URL: https://issues.apache.org/jira/browse/HIVE-24977
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
> Attachments: 24977-failing-query.txt
>
>
> During reduce sink deduplication if some columns from the RS can not be 
> backtracked to a 
> terminal operator then null is returned. Check for null is present for some 
> case and its missing in some cases. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24977) Query compilation failing with NPE during reduce sink deduplication

2021-04-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24977:
---
Attachment: 24977-failing-query.txt

> Query compilation failing with NPE during reduce sink deduplication
> ---
>
> Key: HIVE-24977
> URL: https://issues.apache.org/jira/browse/HIVE-24977
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
> Attachments: 24977-failing-query.txt
>
>
> During reduce sink deduplication if some columns from the RS can not be 
> backtracked to a 
> terminal operator then null is returned. Check for null is present for some 
> case and its missing in some cases. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24977) Query compilation failing with NPE during reduce sink deduplication

2021-04-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24977:
--


> Query compilation failing with NPE during reduce sink deduplication
> ---
>
> Key: HIVE-24977
> URL: https://issues.apache.org/jira/browse/HIVE-24977
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> During reduce sink deduplication if some columns from the RS can not be 
> backtracked to a 
> terminal operator then null is returned. Check for null is present for some 
> case and its missing in some cases. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24883) Add support for array type columns in Hive Joins

2021-03-13 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24883:
--


> Add support for array type columns in Hive Joins
> 
>
> Key: HIVE-24883
> URL: https://issues.apache.org/jira/browse/HIVE-24883
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
> Fix For: 4.0.0
>
>
> Hive fails to execute joins on array type columns as the comparison functions 
> are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24503) Optimize vector row serde by avoiding type check at run time

2021-02-01 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24503.

Resolution: Fixed

> Optimize vector row serde by avoiding type check at run time 
> -
>
> Key: HIVE-24503
> URL: https://issues.apache.org/jira/browse/HIVE-24503
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Serialization/Deserialization of vectorized batch done at VectorSerializeRow 
> and VectorDeserializeRow does a type checking for each column of each row. 
> This becomes very costly when there are billions of rows to read/write. This 
> can be optimized if the type check is done during init time and specific 
> reader/writer classes are created. This classes can be used directly stored 
> in filed structure to avoid run time type check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24589) Drop catalog failing with deadlock error for Oracle backend dbms.

2021-02-01 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera resolved HIVE-24589.

Resolution: Fixed

> Drop catalog failing with deadlock error for Oracle backend dbms.
> -
>
> Key: HIVE-24589
> URL: https://issues.apache.org/jira/browse/HIVE-24589
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we do a drop catalog we drop the catalog from the CTLGS table. The DBS 
> table has a foreign key reference on CTLGS for CTLG_NAME. This is causing the 
> DBS table to be locked exclusively and causing deadlocks. This can be avoided 
> by creating an index in the DBS table on CTLG_NAME.
> {code:java}
> CREATE INDEX CTLG_NAME_DBS ON DBS(CTLG_NAME); {code}
> {code:java}
>  Oracle Database maximizes the concurrency control of parent keys in relation 
> to dependent foreign keys.Locking behaviour depends on whether foreign key 
> columns are indexed. If foreign keys are not indexed, then the child table 
> will probably be locked more frequently, deadlocks will occur, and 
> concurrency will be decreased. For this reason foreign keys should almost 
> always be indexed. The only exception is when the matching unique or primary 
> key is never updated or deleted.{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24589) Drop catalog failing with deadlock error for Oracle backend dbms.

2021-01-05 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24589:
--


> Drop catalog failing with deadlock error for Oracle backend dbms.
> -
>
> Key: HIVE-24589
> URL: https://issues.apache.org/jira/browse/HIVE-24589
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> When we do a drop catalog we drop the catalog from the CTLGS table. The DBS 
> table has a foreign key reference on CTLGS for CTLG_NAME. This is causing the 
> DBS table to be locked exclusively and causing deadlocks. This can be avoided 
> by creating an index in the DBS table on CTLG_NAME.
> {code:java}
> CREATE INDEX CTLG_NAME_DBS ON DBS(CTLG_NAME); {code}
> {code:java}
>  Oracle Database maximizes the concurrency control of parent keys in relation 
> to dependent foreign keys.Locking behaviour depends on whether foreign key 
> columns are indexed. If foreign keys are not indexed, then the child table 
> will probably be locked more frequently, deadlocks will occur, and 
> concurrency will be decreased. For this reason foreign keys should almost 
> always be indexed. The only exception is when the matching unique or primary 
> key is never updated or deleted.{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)

2021-01-03 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24580:
---
Labels:   (was: pull-request-available)

> Add support for combiner in hash mode group aggregation (Support for distinct)
> --
>
> Key: HIVE-24580
> URL: https://issues.apache.org/jira/browse/HIVE-24580
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> For distinct the number of  aggregation function does not match with the 
> number of value column and this needs special handling in the combiner logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)

2021-01-03 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24580:
---
Description: For distinct the number of  aggregation function does not 
match with the number of value column and this needs special handling in the 
combiner logic.  (was: In map side group aggregation, partial grouped 
aggregation is calculated to reduce the data written to disk by map task. In 
case of hash aggregation, where the input data is not sorted, hash table is 
used (with sorting also being performed before flushing). If the hash table 
size increases beyond configurable limit, data is flushed to disk and new hash 
table is generated. If the reduction by hash table is less than min hash 
aggregation reduction calculated during compile time, the map side aggregation 
is converted to streaming mode. So if the first few batch of records does not 
result into significant reduction, then the mode is switched to streaming mode. 
This may have impact on performance, if the subsequent batch of records have 
less number of distinct values. 

To improve performance both in Hash and Streaming mode, a combiner can be added 
to the map task after the keys are sorted. This will make sure that the 
aggregation is done if possible and reduce the data written to disk.)

> Add support for combiner in hash mode group aggregation (Support for distinct)
> --
>
> Key: HIVE-24580
> URL: https://issues.apache.org/jira/browse/HIVE-24580
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> For distinct the number of  aggregation function does not match with the 
> number of value column and this needs special handling in the combiner logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)

2021-01-03 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24580:
---
Parent: HIVE-24471
Issue Type: Sub-task  (was: Bug)

> Add support for combiner in hash mode group aggregation (Support for distinct)
> --
>
> Key: HIVE-24580
> URL: https://issues.apache.org/jira/browse/HIVE-24580
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)

2021-01-03 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24580:
--


> Add support for combiner in hash mode group aggregation (Support for distinct)
> --
>
> Key: HIVE-24580
> URL: https://issues.apache.org/jira/browse/HIVE-24580
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24515) Analyze table job can be skipped when stats populated are already accurate

2020-12-13 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-24515:
--

Assignee: mahesh kumar behera

> Analyze table job can be skipped when stats populated are already accurate
> --
>
> Key: HIVE-24515
> URL: https://issues.apache.org/jira/browse/HIVE-24515
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>
> For non-partitioned tables, stats detail should be present in table level,
> e.g
> {noformat}
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"d_current_day":"true"...
>  }}
>   {noformat}
> For partitioned tables, stats detail should be present in partition level,
> {noformat}
> store_sales(ss_sold_date_sk=2451819)
> {totalSize=0, numRows=0, rawDataSize=0, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"ss_addr_sk":"true"}}
>  
>  {noformat}
> When stats populated are already accurate, {{analyze table tn compute 
> statistics for columns}} should skip launching the job.
>  
> For ACID tables, stats are auto computed and it can skip computing stats 
> again when stats are accurate.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24503) Optimize vector row serde by avoiding type check at run time

2020-12-08 Thread mahesh kumar behera (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-24503:
---
Description: Serialization/Deserialization of vectorized batch done at 
VectorSerializeRow and VectorDeserializeRow does a type checking for each 
column of each row. This becomes very costly when there are billions of rows to 
read/write. This can be optimized if the type check is done during init time 
and specific reader/writer classes are created. This classes can be used 
directly stored in filed structure to avoid run time type check.  (was: 
Serialization/Deserialization of vectorized batch done at VectorSerializeRow 
and VectorDeserializeRow does a type checking for each column of each row. This 
becomes very costly when there are billions of rows to read/write. This can be 
optimized if the type check is done during init time and specific reader/writer 
classes are created. )

> Optimize vector row serde by avoiding type check at run time 
> -
>
> Key: HIVE-24503
> URL: https://issues.apache.org/jira/browse/HIVE-24503
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Serialization/Deserialization of vectorized batch done at VectorSerializeRow 
> and VectorDeserializeRow does a type checking for each column of each row. 
> This becomes very costly when there are billions of rows to read/write. This 
> can be optimized if the type check is done during init time and specific 
> reader/writer classes are created. This classes can be used directly stored 
> in filed structure to avoid run time type check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >