[jira] [Created] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-07-15 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23851:


 Summary: MSCK REPAIR Command With Partition Filtering Fails While 
Dropping Partitions
 Key: HIVE-23851
 URL: https://issues.apache.org/jira/browse/HIVE-23851
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


*Steps to reproduce:*
# Create external table
# Run msck command to sync all the partitions with metastore
# Remove one of the partition path
# Run msck repair with partition filtering

*Stack Trace:*
{code:java}
 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
 java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
 at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
 at 
org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
 [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
 [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_192]
{code}

*Cause:*
In case of msck repair with partition filtering we expect expression proxy 
class to be set as PartitionExpressionForMetastore ( 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
 ), While dropping partition we serialize the drop partition filter expression 
as ( 
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
 ) which is incompatible during deserializtion happening in 
PartitionExpressionForMetastore ( 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52
 ) hence the query fails with Failed to deserialize the expression.

*Solutions*:
I could think of two approaches to this problem
# Since PartitionExpressionForMetastore is required only during parition 
pruning step, We can switch back the expression proxy class to 
MsckPartitionExpressionProxy once the partition pruning step is done.
# The other solution is to make serialization process in msck drop partition 
filter expression compatible with the one with PartitionExpressionForMetastore, 
We can do this via Reflection since the drop partition serialization happens in 
Msck class (standadlone-metatsore) by this way we can completely remove the 
need for class MsckPartitionExpressionProxy and this also helps to reduce the 
complexity of Msck Repair command with parition filtering to work with ease (no 
need to set the expression proxyClass config).

I am personally inclined to the 2nd approach. Before moving on i want to know 
if this is the best approach or is there any other better/easier approach to 
solve this problem.

PS: qtest added in HIVE-22957 mainly focused on adding missing partition. 
Forgot to add case for dropping partition.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23751) QTest: Override #mkdirs() method in ProxyFileSystem To Align After HADOOP-16582

2020-06-23 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23751:


 Summary: QTest: Override #mkdirs() method in ProxyFileSystem To 
Align After HADOOP-16582
 Key: HIVE-23751
 URL: https://issues.apache.org/jira/browse/HIVE-23751
 Project: Hive
  Issue Type: Task
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0, 3.2.0


HADOOP-16582 have changed the way how mkdirs() work:

*Before HADOOP-16582:*
All calls to mkdirs(p) were fast-tracked to FileSystem.mkdirs which were then 
re-routed to mkdirs(p, permission) method. For ProxyFileSytem the call would 
look like

{code:java}
FileUtiles.mkdir(p)  ->  FileSystem.mkdirs(p) ---> 
ProxyFileSytem.mkdirs(p,permission)
{code}
An implementation of FileSystem have only needed implement mkdirs(p, permission)


*After HADOOP-16582:*

Since FilterFileSystem overrides mkdirs(p) method the new call to 
ProxyFileSystem would look like

{code:java}
FileUtiles.mkdir(p) ---> FilterFileSystem.mkdirs(p) -->
{code}

This will make all the qtests fails with the below exception 
{code:java}
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
pfile:/media/ebs1/workspace/hive-3.1-qtest/group/5/label/HiveQTest/hive-1.2.0/itests/qtest/target/warehouse/dest1,
 expected: file:///
{code}
Note: We will hit this issue when we bump up hadoop version in hive.

So as per the discussion in HADOOP-16963 ProxyFileSystem would need to override 
the mkdirs(p) method inorder to solve the above problem. So now the new flow 
would look like


{code:java}
FileUtiles.mkdir(p)  >   ProxyFileSytem.mkdirs(p) ---> 
ProxyFileSytem.mkdirs(p, permission) --->
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23737) LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle Handler Instead Of LLAP's dagDelete

2020-06-22 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23737:


 Summary: LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle 
Handler Instead Of LLAP's dagDelete
 Key: HIVE-23737
 URL: https://issues.apache.org/jira/browse/HIVE-23737
 Project: Hive
  Issue Type: Improvement
 Environment: *strong text*
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


LLAP have a dagDelete feature added as part of HIVE-9911, But now that Tez have 
added support for dagDelete in custom shuffle handler (TEZ-3362) we could 
re-use that feature in LLAP. 
There are some added advantages of using Tez's dagDelete feature rather than 
the current LLAP's dagDelete feature.

1) We can easily extend this feature to accommodate the upcoming features such 
as vertex and failed task attempt shuffle data clean up. Refer TEZ-3363 and 
TEZ-4129

2) It will be more easier to maintain this feature by separating it out from 
the Hive's code path. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23606) LLAP: Delay In DirectByteBuffer Clean Up For EncodedReaderImpl

2020-06-03 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23606:


 Summary: LLAP: Delay In DirectByteBuffer Clean Up For 
EncodedReaderImpl
 Key: HIVE-23606
 URL: https://issues.apache.org/jira/browse/HIVE-23606
 Project: Hive
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


DirectByteBuffler are only cleaned up when there is Full GC or manually invoked 
cleaner method of DirectByteBuffer, Since full GC may take some time to kick 
in, In the meanwhile the native memory usage of LLAP daemon process might shoot 
up and this will force the YARN pmem monitor to kill the container running the 
daemon.

HIVE-16180 tried to solve this problem, but the code structure got messed up 
after HIVE-15665

The IdentityHashMap (toRelease) is initialized in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L409
 , but it is getting re-initialized inside the method getDataFromCacheAndDisk() 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L633
  which makes it local to that method hence the original toRelease 
IdentityHashMap remains empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23085) LLAP: Support Multiple NVMe-SSD disk Locations While Using SSD Cache

2020-03-26 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23085:


 Summary: LLAP: Support Multiple NVMe-SSD disk Locations While 
Using SSD Cache
 Key: HIVE-23085
 URL: https://issues.apache.org/jira/browse/HIVE-23085
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently we can configure only one SSD location while using SSD cache in LLAP. 
This highly undermines the capacity of some machines to use its disk capacity 
to the fullest. For example *AWS* provides *r5d.4x large* series which comes 
with *2 * 300 GB NVme SSD disk* with the current design only one of the mounted 
*NVme SSD* disk can be used for caching. Hence adding support for caching data 
at multiple ssd mounted locations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22957) Support For FilterExp In MSCK Command

2020-03-02 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22957:


 Summary: Support For FilterExp In MSCK Command
 Key: HIVE-22957
 URL: https://issues.apache.org/jira/browse/HIVE-22957
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently MSCK command supports full repair of table (all partitions) or some 
subset of partitions based on partitionSpec. The aim of this jira is to 
introduce a filterExp (=, !=, <, >, >=, <=, LIKE) in MSCK command so that a 
larger subset of partitions can be recovered (added/deleted) without firing a 
full repair might take time if the no. of partitions are huge.

*Approach*:

The initial approach is to add a where clause in MSCK command Eg: MCK REPAIR 
TABLE  ADD|DROP|SYNC PARTITIONS WHERE   
 AND 

*Flow:*

1) Parse the where clause and generate filterExpression

2) fetch all the partitions from the metastore which matches the filter 
expression

3) fetch all the partition file from the filesystem

4) remove all the partition path which does not match with the filter expression

5) Based on ADD | DROP | SYNC do the remaining steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22900) Predicate Push Down Of Like Filter While Fetching Partition Data From MetaStore

2020-02-18 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22900:


 Summary: Predicate Push Down Of Like Filter While Fetching 
Partition Data From MetaStore
 Key: HIVE-22900
 URL: https://issues.apache.org/jira/browse/HIVE-22900
 Project: Hive
  Issue Type: New Feature
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently PPD is disabled for like filter while fetching partition data from 
metastore. The following patch covers all the test cases mentioned in HIVE-5134



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22891) Skip PartitonDesc Extraction In CombineHiveRecord For Non-LLAP Execution Mode

2020-02-14 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22891:


 Summary: Skip PartitonDesc Extraction In CombineHiveRecord For 
Non-LLAP Execution Mode
 Key: HIVE-22891
 URL: https://issues.apache.org/jira/browse/HIVE-22891
 Project: Hive
  Issue Type: Task
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


{code:java}
try {
  // TODO: refactor this out
  if (pathToPartInfo == null) {
MapWork mrwork;
if (HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
  mrwork = (MapWork) Utilities.getMergeWork(jobConf);
  if (mrwork == null) {
mrwork = Utilities.getMapWork(jobConf);
  }
} else {
  mrwork = Utilities.getMapWork(jobConf);
}
pathToPartInfo = mrwork.getPathToPartitionInfo();
  }  PartitionDesc part = extractSinglePartSpec(hsplit);
  inputFormat = HiveInputFormat.wrapForLlap(inputFormat, jobConf, part);
} catch (HiveException e) {
  throw new IOException(e);
}
{code}
The above piece of code in CombineHiveRecordReader.java was introduced in 
HIVE-15147. This overwrites inputFormat based on the PartitionDesc which is not 
required in non-LLAP mode of execution as the method 
HiveInputFormat.wrapForLlap() simply returns the previously defined inputFormat 
in case of non-LLAP mode. The method call extractSinglePartSpec() has some 
serious performance implications. If there are large no. of small files, each 
call in the method extractSinglePartSpec() takes approx ~ (2 - 3) seconds. 
Hence the same query which runs in Hive 1.x / Hive 2 is way faster than the 
query run on latest hive.
{code:java}
2020-02-11 07:15:04,701 INFO [main] 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl: Reading ORC rows from  
2020-02-11 

07:15:06,468 WARN [main] org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: 
Multiple partitions found; not going to pass a part spec to LLAP IO: 
{{logdate=2020-02-03, hour=01, event=win}} and {{logdate=2020-02-03, hour=02, 
event=act}}

2020-02-11 07:15:06,468 INFO [main] 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: succeeded in getting 
org.apache.hadoop.mapred.FileSplit{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22433) Hive JDBC Storage Handler: Incorrect results fetched from BOOLEAN and TIMESTAMP DataType From JDBC Data Source

2019-10-29 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22433:


 Summary: Hive JDBC Storage Handler: Incorrect results fetched from 
BOOLEAN and TIMESTAMP DataType From JDBC Data Source
 Key: HIVE-22433
 URL: https://issues.apache.org/jira/browse/HIVE-22433
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Steps to Reproduce:
{code:java}
//Derby table:
create table testtbl(a BOOLEAN, b TIMESTAMP);

// Insert to table via mysql connector
// data in db
true 2019-11-11 12:00:00

//Hive table:
CREATE EXTERNAL TABLE `hive_table`(   
  a BOOLEAN, b TIMESTAMP
 )   
STORED BY  
  'org.apache.hive.storage.jdbc.JdbcStorageHandler'   
TBLPROPERTIES (
  'hive.sql.database.type'='DERBY',  
  'hive.sql.dbcp.password'='', 
  'hive.sql.dbcp.username'='', 
  'hive.sql.jdbc.driver'='',  
  'hive.sql.jdbc.url'='',  
  'hive.sql.table'='testtbl');

//Hive query:
select * from hive_table;

// result from select query

false 2019-11-11 20:00:00

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22431) Hive JDBC Storage Handler: java.lang.ClassCastException on accessing TINYINT, SMALLINT Data Type From JDBC Data Source

2019-10-29 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22431:


 Summary: Hive JDBC Storage Handler: java.lang.ClassCastException 
on accessing TINYINT, SMALLINT Data Type From JDBC Data Source
 Key: HIVE-22431
 URL: https://issues.apache.org/jira/browse/HIVE-22431
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


Steps to Reproduce:
{code:java}
//MySQL table:
create table testtbl(a TINYINT, b SMALLINT);

// Insert to table

//Hive table:
CREATE EXTERNAL TABLE `hive_table`(   
  a TINYINT, b SMALLINT
 )
ROW FORMAT SERDE   
  'org.apache.hive.storage.jdbc.JdbcSerDe' 
STORED BY  
  'org.apache.hive.storage.jdbc.JdbcStorageHandler'   
TBLPROPERTIES (
  'hive.sql.database.type'='MYSQL',  
  'hive.sql.dbcp.password'='hive', 
  'hive.sql.dbcp.username'='hive', 
  'hive.sql.jdbc.driver'='com.mysql.jdbc.Driver',  
  'hive.sql.jdbc.url'='jdbc:mysql://hadoop/test',  
  'hive.sql.table'='testtbl');

//Hive query:
select * from hive_table;


{code}
*Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.lang.Byte*

*Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.lang.Short*
{code:java}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22409) Logging: Implement QueryID Based Hive Logging

2019-10-26 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22409:


 Summary: Logging: Implement QueryID Based Hive Logging
 Key: HIVE-22409
 URL: https://issues.apache.org/jira/browse/HIVE-22409
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


Currently all the hive logs are logged in 
${sys:hive.log.dir}/${sys:hive.log.file} which is basically a single log file. 
Over the time it becomes tedious to search for logs since multiple hive query 
logs are logged into single log file.

Hence we propose a queryID based hive logging where logs of different queries 
are logged into a separate log file based on their queryID.

CC [~prasanth_j] [~gopalv] [~sseth]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22392) Hive JDBC Storage Handler: Support For Writing Data to JDBC Data Source

2019-10-22 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22392:


 Summary: Hive JDBC Storage Handler: Support For Writing Data to 
JDBC Data Source
 Key: HIVE-22392
 URL: https://issues.apache.org/jira/browse/HIVE-22392
 Project: Hive
  Issue Type: New Feature
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


JDBC Storage Handler supports reading from JDBC data source in Hive. Currently 
writing to a JDBC data source is not supported. Hence adding support for simple 
insert query so that the data can be written back to JDBC data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-21454) Tez default configs get overwritten by MR default configs

2019-03-15 Thread Syed Shameerur Rahman (JIRA)
Syed Shameerur Rahman created HIVE-21454:


 Summary: Tez default configs get overwritten by MR default configs
 Key: HIVE-21454
 URL: https://issues.apache.org/jira/browse/HIVE-21454
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman


Due to changes done in HIVE-17781 Tez default configs such as tez.counters.max 
which has a default value of 1200 gets overwritten by 
mapreduce.job.counters.max which has a default value of 120



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)