[jira] [Created] (HIVE-27142) Map Join not working as expected when joining non-native tables with native tables

2023-03-15 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-27142:


 Summary:  Map Join not working as expected when joining non-native 
tables with native tables
 Key: HIVE-27142
 URL: https://issues.apache.org/jira/browse/HIVE-27142
 Project: Hive
  Issue Type: Bug
Affects Versions: All Versions
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


*1. Issue :*

When *_hive.auto.convert.join=true_* and if the underlying query is trying to 
join a large non-native hive table with a small native hive table, The map join 
is happening in the wrong side i.e on the map task which process the small 
native hive table and it can lead to OOM when the non-native table is really 
large and only few map tasks are spawned to scan the small native hive tables.

 

*2. Why is this happening ?*

This happens due to improper stats collection/computation of non native hive 
tables. Since the non-native hive tables are actually stored in a different 
location which Hive does not know of and only a temporary path which is visible 
to Hive while creating a non native table does not store the actual data, The 
stats collection logic tend to under estimate the data/rows and hence causes 
the map join to happen in the wrong side.

 

*3. Potential Solutions*

 3.1  Turn off *_hive.auto.convert.join=false._* This can have a negative 
impact of the query    if  the same query is trying to do multiple joins i.e 
one join with non-native tables and other join where both the tables are native.

 3.2 Compute stats for non-native table by firing the ANALYZE TABLE <> command 
before joining native and non-native commands. The user may or may not choose 
to do it.

 3.3 Don't collect/estimate stats for non-native hive tables by default 
(Preferred solution)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26787) Pushdown Timestamp data type to metastore via direct sql / JDO

2022-11-28 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-26787:


 Summary: Pushdown Timestamp data type to metastore via direct sql 
/ JDO
 Key: HIVE-26787
 URL: https://issues.apache.org/jira/browse/HIVE-26787
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


make timestamp data type push down to hive metastore during partition pruning. 
This is in similar lines with the jira: 
https://issues.apache.org/jira/browse/HIVE-26778



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26778) Pushdown Date data type to metastore via direct sql / JDO

2022-11-25 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-26778:


 Summary: Pushdown Date data type to metastore via direct sql / JDO
 Key: HIVE-26778
 URL: https://issues.apache.org/jira/browse/HIVE-26778
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


The original feature to push down date data type while doing partition pruning 
via direct sql / JDO was added as part of the jira : 
https://issues.apache.org/jira/browse/HIVE-5679

Since the behavior of Hive has changed with CBO, Now when CBO is turned on, The 
date data types are not pushed down to metastore due to CBO adding extra 
keyword 'DATE' with the original filter since the filter parser is not handled 
to parse this extra keyword it fails and hence the date data type is not pushed 
down to the metastore.


{code:java}
select * from test_table where date_col = '2022-01-01';
{code}

When CBO is turned on, The filter predicate generated is 
date_col=DATE'2022-01-01'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26467) SessionState should be accessible inside ThreadPool

2022-08-12 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-26467:


 Summary: SessionState should be accessible inside ThreadPool
 Key: HIVE-26467
 URL: https://issues.apache.org/jira/browse/HIVE-26467
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently SessionState.get() returns null if it is called inside a ThreadPool. 
If there is any custom third party component leverages SessionState.get() for 
some operations like getting the session state or session config it will result 
in null since session state is thread local 
(https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L622)
 and ThreadLocal variable are not inheritable to child threads / thread pools.

So one solution is to make the thread local variable inheritable so the 
SessionState gets propagated to child threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-25942) Upgrade commons-io to 2.8.0 due to CVE-2021-29425

2022-02-09 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-25942:


 Summary: Upgrade commons-io to 2.8.0 due to CVE-2021-29425
 Key: HIVE-25942
 URL: https://issues.apache.org/jira/browse/HIVE-25942
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Due to [CVE-2021-29425|https://nvd.nist.gov/vuln/detail/CVE-2021-29425] all the 
commons-io versions below 2.7 are affected.

Tez and Hadoop have upgraded commons-io to 2.8.0 in 
[TEZ-4353|https://issues.apache.org/jira/browse/TEZ-4353] and 
[HADOOP-17683|https://issues.apache.org/jira/browse/HADOOP-17683] respectively 
and it will be good if Hive also follows the same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25907) IOW Directory queries fails to write data to final path when query result cache is enabled

2022-01-27 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-25907:


 Summary: IOW Directory queries fails to write data to final path 
when query result cache is enabled
 Key: HIVE-25907
 URL: https://issues.apache.org/jira/browse/HIVE-25907
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


INSERT OVERWRITE DIRECTORY queries fails to write the data to the specified 
directory location when query result cache is enabled.

*Steps to reproduce*

{code:java}
1. create a data file with the following data

1 abc 10.5
2 def 11.5


2. create table pointing to that data

create external table iowd(strct struct)
row format delimited
fields terminated by '\t'
collection items terminated by ' '
location '';


3. run the following query

set hive.query.results.cache.enabled=true;
INSERT OVERWRITE DIRECTORY "" SELECT * FROM iowd;
{code}

After execution of the above query, It is expected that the destination 
directory contains data from the table iowd, But due to HIVE-21386 it is not 
happening anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25680) Authorize #get_table_meta HiveMetastore Server API to use any of the HiveMetastore Authorization model

2021-11-08 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-25680:


 Summary: Authorize #get_table_meta HiveMetastore Server API to use 
any of the HiveMetastore Authorization model
 Key: HIVE-25680
 URL: https://issues.apache.org/jira/browse/HIVE-25680
 Project: Hive
  Issue Type: Bug
Affects Versions: All Versions
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0
 Attachments: Screenshot 2021-11-08 at 2.39.30 PM.png

When Apache Hue or any other application which uses #get_table_meta API is not 
gated to use any of the authorization model which HiveMetastore provides.

For more information on Storage based Authorization Model : 
https://cwiki.apache.org/confluence/display/Hive/HCatalog+Authorization

You can easily reproduce this with Apache Hive + Apache Hue

{code:java}
  
hive.security.metastore.authorization.manager

org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
  


hive.security.metastore.authenticator.manager

org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
  


hive.metastore.pre.event.listeners

org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
  
{code}


{code:java}
#!/bin/bash

set -x

hdfs dfs -mkdir /datasets

hdfs dfs -mkdir /datasets/database1

hdfs dfs -mkdir /datasets/database1/table1

echo "stefano,1992" | hdfs dfs -put - /datasets/database1/table1/file1.csv

hdfs dfs -chmod -R 700 /datasets/database1

sudo tee -a setup.hql > /dev/null < create the first user called "admin" and provide a password 
Access the Hive Editor
2. On the SQL section on the left under Databases you should see default and 
database1 listed. Click on database1
3. As you can see a table called table1 is listed => this should not be 
possible as our admin user has no HDFS grants on /datasets/database1
4. run from the Hive editor the following query SHOW TABLES; The output shows a 
Permission denied error => this is the expected behavior



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25443) Arrow SerDe Cannot serialize/deserialize complex data types When there are more than 1024 values

2021-08-11 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-25443:


 Summary: Arrow SerDe Cannot serialize/deserialize complex data 
types When there are more than 1024 values
 Key: HIVE-25443
 URL: https://issues.apache.org/jira/browse/HIVE-25443
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 3.1.2, 3.1.1, 3.0.0, 3.1.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Complex data types like MAP, STRUCT cannot be serialized/deserialzed using 
Arrow SerDe when there are more than 1024 values. This happens due to 
ColumnVector always being initialized with a size of 1024.

Issue #1 : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/arrow/ArrowColumnarBatchSerDe.java#L213

Issue #2 : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/arrow/ArrowColumnarBatchSerDe.java#L215

Sample unit test to reproduce the case in TestArrowColumnarBatchSerDe :


{code:java}
@Test
   public void testListBooleanWithMoreThan1024Values() throws SerDeException {
 String[][] schema = {
 {"boolean_list", "array"},
 };
  
 Object[][] rows = new Object[1025][1];
 for (int i = 0; i < 1025; i++) {
   rows[i][0] = new BooleanWritable(true);
 }
  
 initAndSerializeAndDeserialize(schema, toList(rows));
   }
  
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24690) GlobalLimitOptimizer Fails To Identify Some Queries With LIMIT Operator

2021-01-27 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-24690:


 Summary: GlobalLimitOptimizer Fails To Identify Some Queries With 
LIMIT Operator
 Key: HIVE-24690
 URL: https://issues.apache.org/jira/browse/HIVE-24690
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 3.1.0, 2.1.0, 1.1.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


As per 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GlobalLimitOptimizer.java#L88]
 queries like
{code:java}
CREATE TABLE ... AS SELECT col1, col2 FROM tbl LIMIT ..
INSERT OVERWRITE TABLE ... SELECT col1, hash(col2), split(col1) FROM ... 
LIMIT...
{code}
falls under the category of qualified list, But after HIVE-9444 it is not.

On investigating this issue, It is found that for
{code:java}
CREATE TABLE ... AS SELECT col1, col2 FROM tbl LIMIT 
{code}
query the operator tree looks like *TS -> SEL -> LIM -> RS -> SEL -> LIM -> FS*

Since only only LIMIT operator is allowed as per 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GlobalLimitOptimizer.java#L196
 , The *GlobalLimitOptimizer* fails to identify such queries.

*Steps To Reproduce*

{code:java}
set hive.limit.optimize.enable=true;
create table t1 (a int);
create table t2 select * from t1 LIMIT 10;
{code}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-07-15 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23851:


 Summary: MSCK REPAIR Command With Partition Filtering Fails While 
Dropping Partitions
 Key: HIVE-23851
 URL: https://issues.apache.org/jira/browse/HIVE-23851
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


*Steps to reproduce:*
# Create external table
# Run msck command to sync all the partitions with metastore
# Remove one of the partition path
# Run msck repair with partition filtering

*Stack Trace:*
{code:java}
 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
 java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
 at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
 at 
org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
 [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
 [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
 at 
org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
 [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_192]
{code}

*Cause:*
In case of msck repair with partition filtering we expect expression proxy 
class to be set as PartitionExpressionForMetastore ( 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
 ), While dropping partition we serialize the drop partition filter expression 
as ( 
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
 ) which is incompatible during deserializtion happening in 
PartitionExpressionForMetastore ( 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52
 ) hence the query fails with Failed to deserialize the expression.

*Solutions*:
I could think of two approaches to this problem
# Since PartitionExpressionForMetastore is required only during parition 
pruning step, We can switch back the expression proxy class to 
MsckPartitionExpressionProxy once the partition pruning step is done.
# The other solution is to make serialization process in msck drop partition 
filter expression compatible with the one with PartitionExpressionForMetastore, 
We can do this via Reflection since the drop partition serialization happens in 
Msck class (standadlone-metatsore) by this way we can completely remove the 
need for class MsckPartitionExpressionProxy and this also helps to reduce the 
complexity of Msck Repair command with parition filtering to work with ease (no 
need to set the expression proxyClass config).

I am personally inclined to the 2nd approach. Before moving on i want to know 
if this is the best approach or is there any other better/easier approach to 
solve this problem.

PS: qtest added in HIVE-22957 mainly focused on adding missing partition. 
Forgot to add case for dropping partition.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23751) QTest: Override #mkdirs() method in ProxyFileSystem To Align After HADOOP-16582

2020-06-23 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23751:


 Summary: QTest: Override #mkdirs() method in ProxyFileSystem To 
Align After HADOOP-16582
 Key: HIVE-23751
 URL: https://issues.apache.org/jira/browse/HIVE-23751
 Project: Hive
  Issue Type: Task
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0, 3.2.0


HADOOP-16582 have changed the way how mkdirs() work:

*Before HADOOP-16582:*
All calls to mkdirs(p) were fast-tracked to FileSystem.mkdirs which were then 
re-routed to mkdirs(p, permission) method. For ProxyFileSytem the call would 
look like

{code:java}
FileUtiles.mkdir(p)  ->  FileSystem.mkdirs(p) ---> 
ProxyFileSytem.mkdirs(p,permission)
{code}
An implementation of FileSystem have only needed implement mkdirs(p, permission)


*After HADOOP-16582:*

Since FilterFileSystem overrides mkdirs(p) method the new call to 
ProxyFileSystem would look like

{code:java}
FileUtiles.mkdir(p) ---> FilterFileSystem.mkdirs(p) -->
{code}

This will make all the qtests fails with the below exception 
{code:java}
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
pfile:/media/ebs1/workspace/hive-3.1-qtest/group/5/label/HiveQTest/hive-1.2.0/itests/qtest/target/warehouse/dest1,
 expected: file:///
{code}
Note: We will hit this issue when we bump up hadoop version in hive.

So as per the discussion in HADOOP-16963 ProxyFileSystem would need to override 
the mkdirs(p) method inorder to solve the above problem. So now the new flow 
would look like


{code:java}
FileUtiles.mkdir(p)  >   ProxyFileSytem.mkdirs(p) ---> 
ProxyFileSytem.mkdirs(p, permission) --->
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23737) LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle Handler Instead Of LLAP's dagDelete

2020-06-22 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23737:


 Summary: LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle 
Handler Instead Of LLAP's dagDelete
 Key: HIVE-23737
 URL: https://issues.apache.org/jira/browse/HIVE-23737
 Project: Hive
  Issue Type: Improvement
 Environment: *strong text*
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


LLAP have a dagDelete feature added as part of HIVE-9911, But now that Tez have 
added support for dagDelete in custom shuffle handler (TEZ-3362) we could 
re-use that feature in LLAP. 
There are some added advantages of using Tez's dagDelete feature rather than 
the current LLAP's dagDelete feature.

1) We can easily extend this feature to accommodate the upcoming features such 
as vertex and failed task attempt shuffle data clean up. Refer TEZ-3363 and 
TEZ-4129

2) It will be more easier to maintain this feature by separating it out from 
the Hive's code path. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23606) LLAP: Delay In DirectByteBuffer Clean Up For EncodedReaderImpl

2020-06-03 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23606:


 Summary: LLAP: Delay In DirectByteBuffer Clean Up For 
EncodedReaderImpl
 Key: HIVE-23606
 URL: https://issues.apache.org/jira/browse/HIVE-23606
 Project: Hive
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


DirectByteBuffler are only cleaned up when there is Full GC or manually invoked 
cleaner method of DirectByteBuffer, Since full GC may take some time to kick 
in, In the meanwhile the native memory usage of LLAP daemon process might shoot 
up and this will force the YARN pmem monitor to kill the container running the 
daemon.

HIVE-16180 tried to solve this problem, but the code structure got messed up 
after HIVE-15665

The IdentityHashMap (toRelease) is initialized in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L409
 , but it is getting re-initialized inside the method getDataFromCacheAndDisk() 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L633
  which makes it local to that method hence the original toRelease 
IdentityHashMap remains empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23085) LLAP: Support Multiple NVMe-SSD disk Locations While Using SSD Cache

2020-03-26 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-23085:


 Summary: LLAP: Support Multiple NVMe-SSD disk Locations While 
Using SSD Cache
 Key: HIVE-23085
 URL: https://issues.apache.org/jira/browse/HIVE-23085
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently we can configure only one SSD location while using SSD cache in LLAP. 
This highly undermines the capacity of some machines to use its disk capacity 
to the fullest. For example *AWS* provides *r5d.4x large* series which comes 
with *2 * 300 GB NVme SSD disk* with the current design only one of the mounted 
*NVme SSD* disk can be used for caching. Hence adding support for caching data 
at multiple ssd mounted locations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22957) Support For FilterExp In MSCK Command

2020-03-02 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22957:


 Summary: Support For FilterExp In MSCK Command
 Key: HIVE-22957
 URL: https://issues.apache.org/jira/browse/HIVE-22957
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently MSCK command supports full repair of table (all partitions) or some 
subset of partitions based on partitionSpec. The aim of this jira is to 
introduce a filterExp (=, !=, <, >, >=, <=, LIKE) in MSCK command so that a 
larger subset of partitions can be recovered (added/deleted) without firing a 
full repair might take time if the no. of partitions are huge.

*Approach*:

The initial approach is to add a where clause in MSCK command Eg: MCK REPAIR 
TABLE  ADD|DROP|SYNC PARTITIONS WHERE   
 AND 

*Flow:*

1) Parse the where clause and generate filterExpression

2) fetch all the partitions from the metastore which matches the filter 
expression

3) fetch all the partition file from the filesystem

4) remove all the partition path which does not match with the filter expression

5) Based on ADD | DROP | SYNC do the remaining steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22900) Predicate Push Down Of Like Filter While Fetching Partition Data From MetaStore

2020-02-18 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22900:


 Summary: Predicate Push Down Of Like Filter While Fetching 
Partition Data From MetaStore
 Key: HIVE-22900
 URL: https://issues.apache.org/jira/browse/HIVE-22900
 Project: Hive
  Issue Type: New Feature
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Currently PPD is disabled for like filter while fetching partition data from 
metastore. The following patch covers all the test cases mentioned in HIVE-5134



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22891) Skip PartitonDesc Extraction In CombineHiveRecord For Non-LLAP Execution Mode

2020-02-14 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22891:


 Summary: Skip PartitonDesc Extraction In CombineHiveRecord For 
Non-LLAP Execution Mode
 Key: HIVE-22891
 URL: https://issues.apache.org/jira/browse/HIVE-22891
 Project: Hive
  Issue Type: Task
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


{code:java}
try {
  // TODO: refactor this out
  if (pathToPartInfo == null) {
MapWork mrwork;
if (HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
  mrwork = (MapWork) Utilities.getMergeWork(jobConf);
  if (mrwork == null) {
mrwork = Utilities.getMapWork(jobConf);
  }
} else {
  mrwork = Utilities.getMapWork(jobConf);
}
pathToPartInfo = mrwork.getPathToPartitionInfo();
  }  PartitionDesc part = extractSinglePartSpec(hsplit);
  inputFormat = HiveInputFormat.wrapForLlap(inputFormat, jobConf, part);
} catch (HiveException e) {
  throw new IOException(e);
}
{code}
The above piece of code in CombineHiveRecordReader.java was introduced in 
HIVE-15147. This overwrites inputFormat based on the PartitionDesc which is not 
required in non-LLAP mode of execution as the method 
HiveInputFormat.wrapForLlap() simply returns the previously defined inputFormat 
in case of non-LLAP mode. The method call extractSinglePartSpec() has some 
serious performance implications. If there are large no. of small files, each 
call in the method extractSinglePartSpec() takes approx ~ (2 - 3) seconds. 
Hence the same query which runs in Hive 1.x / Hive 2 is way faster than the 
query run on latest hive.
{code:java}
2020-02-11 07:15:04,701 INFO [main] 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl: Reading ORC rows from  
2020-02-11 

07:15:06,468 WARN [main] org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: 
Multiple partitions found; not going to pass a part spec to LLAP IO: 
{{logdate=2020-02-03, hour=01, event=win}} and {{logdate=2020-02-03, hour=02, 
event=act}}

2020-02-11 07:15:06,468 INFO [main] 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: succeeded in getting 
org.apache.hadoop.mapred.FileSplit{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22433) Hive JDBC Storage Handler: Incorrect results fetched from BOOLEAN and TIMESTAMP DataType From JDBC Data Source

2019-10-29 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22433:


 Summary: Hive JDBC Storage Handler: Incorrect results fetched from 
BOOLEAN and TIMESTAMP DataType From JDBC Data Source
 Key: HIVE-22433
 URL: https://issues.apache.org/jira/browse/HIVE-22433
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Steps to Reproduce:
{code:java}
//Derby table:
create table testtbl(a BOOLEAN, b TIMESTAMP);

// Insert to table via mysql connector
// data in db
true 2019-11-11 12:00:00

//Hive table:
CREATE EXTERNAL TABLE `hive_table`(   
  a BOOLEAN, b TIMESTAMP
 )   
STORED BY  
  'org.apache.hive.storage.jdbc.JdbcStorageHandler'   
TBLPROPERTIES (
  'hive.sql.database.type'='DERBY',  
  'hive.sql.dbcp.password'='', 
  'hive.sql.dbcp.username'='', 
  'hive.sql.jdbc.driver'='',  
  'hive.sql.jdbc.url'='',  
  'hive.sql.table'='testtbl');

//Hive query:
select * from hive_table;

// result from select query

false 2019-11-11 20:00:00

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22431) Hive JDBC Storage Handler: java.lang.ClassCastException on accessing TINYINT, SMALLINT Data Type From JDBC Data Source

2019-10-29 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22431:


 Summary: Hive JDBC Storage Handler: java.lang.ClassCastException 
on accessing TINYINT, SMALLINT Data Type From JDBC Data Source
 Key: HIVE-22431
 URL: https://issues.apache.org/jira/browse/HIVE-22431
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


Steps to Reproduce:
{code:java}
//MySQL table:
create table testtbl(a TINYINT, b SMALLINT);

// Insert to table

//Hive table:
CREATE EXTERNAL TABLE `hive_table`(   
  a TINYINT, b SMALLINT
 )
ROW FORMAT SERDE   
  'org.apache.hive.storage.jdbc.JdbcSerDe' 
STORED BY  
  'org.apache.hive.storage.jdbc.JdbcStorageHandler'   
TBLPROPERTIES (
  'hive.sql.database.type'='MYSQL',  
  'hive.sql.dbcp.password'='hive', 
  'hive.sql.dbcp.username'='hive', 
  'hive.sql.jdbc.driver'='com.mysql.jdbc.Driver',  
  'hive.sql.jdbc.url'='jdbc:mysql://hadoop/test',  
  'hive.sql.table'='testtbl');

//Hive query:
select * from hive_table;


{code}
*Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.lang.Byte*

*Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.lang.Short*
{code:java}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22409) Logging: Implement QueryID Based Hive Logging

2019-10-26 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22409:


 Summary: Logging: Implement QueryID Based Hive Logging
 Key: HIVE-22409
 URL: https://issues.apache.org/jira/browse/HIVE-22409
 Project: Hive
  Issue Type: Improvement
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


Currently all the hive logs are logged in 
${sys:hive.log.dir}/${sys:hive.log.file} which is basically a single log file. 
Over the time it becomes tedious to search for logs since multiple hive query 
logs are logged into single log file.

Hence we propose a queryID based hive logging where logs of different queries 
are logged into a separate log file based on their queryID.

CC [~prasanth_j] [~gopalv] [~sseth]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22392) Hive JDBC Storage Handler: Support For Writing Data to JDBC Data Source

2019-10-22 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-22392:


 Summary: Hive JDBC Storage Handler: Support For Writing Data to 
JDBC Data Source
 Key: HIVE-22392
 URL: https://issues.apache.org/jira/browse/HIVE-22392
 Project: Hive
  Issue Type: New Feature
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


JDBC Storage Handler supports reading from JDBC data source in Hive. Currently 
writing to a JDBC data source is not supported. Hence adding support for simple 
insert query so that the data can be written back to JDBC data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-21454) Tez default configs get overwritten by MR default configs

2019-03-15 Thread Syed Shameerur Rahman (JIRA)
Syed Shameerur Rahman created HIVE-21454:


 Summary: Tez default configs get overwritten by MR default configs
 Key: HIVE-21454
 URL: https://issues.apache.org/jira/browse/HIVE-21454
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman


Due to changes done in HIVE-17781 Tez default configs such as tez.counters.max 
which has a default value of 1200 gets overwritten by 
mapreduce.job.counters.max which has a default value of 120



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)