[jira] [Created] (HIVE-24313) Optimise stats collection for file sizes on cloud storage

2020-10-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24313:
---

 Summary: Optimise stats collection for file sizes on cloud storage
 Key: HIVE-24313
 URL: https://issues.apache.org/jira/browse/HIVE-24313
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


When stats information is not present (e.g external table), RelOptHiveTable 
computes basic stats at runtime.

Following is the codepath.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L598]
{code:java}
Statistics stats = StatsUtils.collectStatistics(hiveConf, partitionList,
hiveTblMetadata, hiveNonPartitionCols, 
nonPartColNamesThatRqrStats, colStatsCached,
nonPartColNamesThatRqrStats, true);
 {code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L322]
{code:java}
for (Partition p : partList.getNotDeniedPartns()) {
BasicStats basicStats = basicStatsFactory.build(Partish.buildFor(table, 
p));
partStats.add(basicStats);
  }
 {code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205]

 
{code:java}
try {
ds = getFileSizeForPath(path);
  } catch (IOException e) {
ds = 0L;
  }
 {code}
 

For a table & query with large number of partitions, this takes long time to 
compute statistics and increases compilation time.  It would be good to fix it 
with "ForkJoinPool" ( 
partList.getNotDeniedPartns().parallelStream().forEach((p) )

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24296) NDV adjusted twice causing reducer task underestimation

2020-10-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24296:
---

 Summary: NDV adjusted twice causing reducer task underestimation
 Key: HIVE-24296
 URL: https://issues.apache.org/jira/browse/HIVE-24296
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2550]

 

{{StatsRuleProcFactory::updateColStats}}::
{code:java}
if (ratio <= 1.0) {
  newDV = (long) Math.ceil(ratio * oldDV);
}

cs.setCountDistint(newDV);
{code}
Though RelHive* has the latest statistics, it is adjusted again 
{{StatsRuleProcFactory::updateColStats}} and it is done at linear scale.

 

Because of this, downstream vertex gets lesser number of tasks causing latency 
issues.

E.g Q10 + TPCDS @10 TB scale. Attaching a snippet of "explain analyze" which 
shows stats underestimation.

"Reducer 13" is underestimated 10x, when compared to runtime details. Projected 
NDV from RelHive* was around 65989699.

However, due to the ratio calculation in StatsRuleProcFactory, it gets 
readjusted to ((948122598/14291978461) * 65989699)) ~= 4377723.

It would be good to remove static readjustment in StatsRuleProcFactory.
{noformat}
Edges:
Map 10 <- Map 9 (BROADCAST_EDGE)
Map 12 <- Map 9 (BROADCAST_EDGE)
Map 2 <- Map 7 (BROADCAST_EDGE)
Map 8 <- Map 9 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE)
Reducer 11 <- Map 10 (SIMPLE_EDGE)
Reducer 13 <- Map 12 (SIMPLE_EDGE)
Reducer 3 <- Map 1 (BROADCAST_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE), Map 8 
(CUSTOM_SIMPLE_EDGE), Reducer 11 (BROADCAST_EDGE), Reducer 13 (BROADCAST_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
Reducer 6 <- Map 2 (CUSTOM_SIMPLE_EDGE)


Map 12
Map Operator Tree:
TableScan
  alias: catalog_sales
  filterExpr: cs_ship_customer_sk is not null (type: boolean)
  Statistics: Num rows: 14327953968/552509183 Data size: 
228959459440 Basic stats: COMPLETE Column stats: COMPLETE
  Filter Operator
predicate: cs_ship_customer_sk is not null (type: boolean)
Statistics: Num rows: 14291978461/551122492 Data size: 
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
  expressions: cs_ship_customer_sk (type: bigint), 
cs_sold_date_sk (type: bigint)
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 14291978461/551122492 Data size: 
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
  Map Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col1 (type: bigint)
  1 _col0 (type: bigint)
outputColumnNames: _col0
input vertices:
  1 Map 9
Statistics: Num rows: 948122598/551122492 Data size: 
7297899376 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
  keys: _col0 (type: bigint)
  minReductionHashAggr: 0.99
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 126954025/61576194 Data size: 
977191880 Basic stats: COMPLETE Column stats: COMPLETE
  Reduce Output Operator
key expressions: _col0 (type: bigint)
null sort order: a
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 126954025/61576194 Data size: 
977191880 Basic stats: COMPLETE Column stats: COMPLETE

...
...
Reducer 13
Execution mode: vectorized, llap
Reduce Operator Tree:
  Group By Operator
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 4377725/40166690 Data size: 33696280 
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
  expressions: true (type: boolean), _col0 (type: bigint)
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 4377725/40166690 Data size: 51207180 
Basic stats: COMPLETE Column stats: COMPLETE
  Reduce Output Operator
key expressions: _col1 (type: bigint)
null sort order: a
sort order: +

[jira] [Created] (HIVE-24290) Explain analyze can be slow in cloud storage

2020-10-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24290:
---

 Summary: Explain analyze can be slow in cloud storage
 Key: HIVE-24290
 URL: https://issues.apache.org/jira/browse/HIVE-24290
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


"explain analyze" takes a lot longer time to exit with the following path, 
specifically in cloud environments (where it could be EFS vol).  HIVE-24270 is 
a related ticket as well.

 
{noformat}
at java.io.UnixFileSystem.delete0(Native Method)
at java.io.UnixFileSystem.delete(UnixFileSystem.java:265)
at java.io.File.delete(File.java:1043)
at org.apache.hadoop.fs.FileUtil.deleteImpl(FileUtil.java:229)
at org.apache.hadoop.fs.FileUtil.fullyDeleteContents(FileUtil.java:270)
at org.apache.hadoop.fs.FileUtil.fullyDelete(FileUtil.java:182)
at org.apache.hadoop.fs.FileUtil.fullyDelete(FileUtil.java:153)
at 
org.apache.hadoop.fs.RawLocalFileSystem.delete(RawLocalFileSystem.java:453)
at 
org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:685)
at 
org.apache.hadoop.hive.ql.stats.fs.FSStatsAggregator.closeConnection(FSStatsAggregator.java:115)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.aggregateStats(ExplainSemanticAnalyzer.java:261)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:156)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:288)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:221)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:188)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:600)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:546)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:540)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:127)

 {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24289) RetryingMetaStoreClient should not retry connecting to HMS on genuine errors

2020-10-19 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24289:
---

 Summary: RetryingMetaStoreClient should not retry connecting to 
HMS on genuine errors
 Key: HIVE-24289
 URL: https://issues.apache.org/jira/browse/HIVE-24289
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


When there is genuine error from HMS, it should not be retried in 
RetryingMetaStoreClient. 

For e.g, following query would be retried multiple times (~20+ times) in HMS 
causing huge delay in processing, even though this constraint is available in 
HMS. 

It should just throw exception to client and stop retrying in such cases.

{noformat}
alter table web_sales add constraint tpcds_bin_partitioned_orc_1_ws_s_hd 
foreign key  (ws_ship_hdemo_sk) references household_demographics (hd_demo_sk) 
disable novalidate rely;

org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Internal error processing 
add_foreign_key
at org.apache.hadoop.hive.ql.metadata.Hive.addForeignKey(Hive.java:5914)
..
...
Caused by: org.apache.thrift.TApplicationException: Internal error processing 
add_foreign_key
   at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_add_foreign_key(ThriftHiveMetastore.java:1872)
{noformat}

https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.java#L256

For e.g, if exception contains "Internal error processing ", it could stop 
retrying all over again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24262) Optimise NullScanTaskDispatcher for cloud storage

2020-10-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24262:
---

 Summary: Optimise NullScanTaskDispatcher for cloud storage
 Key: HIVE-24262
 URL: https://issues.apache.org/jira/browse/HIVE-24262
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


{noformat}
select count(DISTINCT ss_sold_date_sk) from store_sales;

--
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
--
Map 1 .. container SUCCEEDED  1  100
   0   0
Reducer 2 .. container SUCCEEDED  1  100
   0   0
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 5.55 s
--
INFO  : Status: DAG finished successfully in 5.44 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query 102.02s
INFO  : Prepare Plan0.51s
INFO  : Get Query Coordinator (AM)  0.01s
INFO  : Submit Plan 0.33s
INFO  : Start DAG   0.56s
INFO  : Run DAG 5.44s
INFO  : 
--

{noformat}

Reason for this is that, it ends up doing "isEmptyPath" check for every 
partition path and takes lot of time in compilation phase.


If the parent directory of all paths belong to the same path, we could just do 
a recursive listing just once (instead of listing each directory one at a time 
sequentially) in cloud storage systems.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L158

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L121

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L101


With a temp hacky fix, it comes down to 2 seconds from 100+ seconds.

{noformat}
INFO  : Dag name: select count(DISTINCT ss_sold_...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1602500203747_0003)

--
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
--
Map 1 .. container SUCCEEDED  1  100
   0   0
Reducer 2 .. container SUCCEEDED  1  100
   0   0
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 1.23 s
--
INFO  : Status: DAG finished successfully in 1.20 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query   0.85s
INFO  : Prepare Plan0.17s
INFO  : Get Query Coordinator (AM)  0.00s
INFO  : Submit Plan 0.03s
INFO  : Start DAG   0.03s
INFO  : Run DAG 1.20s
INFO  : 
--
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator

2020-10-06 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24234:
---

 Summary: Improve checkHashModeEfficiency in VectorGroupByOperator
 Key: HIVE-24234
 URL: https://issues.apache.org/jira/browse/HIVE-24234
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the 
number of entries with the number input records that have been processed. For 
grouping sets, it accounts for grouping set length as well.

Issue is that, the condition becomes invalid after processing large number of 
input records. This prevents the system from switching over to streaming mode. 

e.g Assume 500,000 input records processed, with 9 grouping sets, with 100,000 
entries in hashtable. Hashtable would never cross 4,500, entries as the max 
size itself is 1M by default. 

It would be good to compare the input records (adjusted for grouping sets) with 
number of output records (along with size of hashtable size) to determine 
hashing or streaming mode.

E.g Q67.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24212) Refactor to take advantage of listStatus optimisations in cloud storage connectors

2020-09-30 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24212:
---

 Summary: Refactor to take advantage of listStatus optimisations in 
cloud storage connectors
 Key: HIVE-24212
 URL: https://issues.apache.org/jira/browse/HIVE-24212
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


https://issues.apache.org/jira/browse/HADOOP-17022, 
https://issues.apache.org/jira/browse/HADOOP-17281, 
https://issues.apache.org/jira/browse/HADOOP-16830 etc help in reducing number 
of roundtrips to remote systems in cloud storage.

Creating this ticket to do minor refactoring to take advantage of the above 
optimizations.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24207) LimitOperator can leverage ObjectCache to bail out quickly

2020-09-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24207:
---

 Summary: LimitOperator can leverage ObjectCache to bail out quickly
 Key: HIVE-24207
 URL: https://issues.apache.org/jira/browse/HIVE-24207
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


{noformat}
select  ss_sold_date_sk from store_sales, date_dim where date_dim.d_year in 
(1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = date_dim.d_date_sk limit 
100;

 select distinct ss_sold_date_sk from store_sales, date_dim where 
date_dim.d_year in (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = 
date_dim.d_date_sk limit 100;

 {noformat}

Queries like the above generate a large number of map tasks. Currently they 
don't bail out after generating enough amount of data. 

It would be good to make use of ObjectCache & retain the number of records 
generated. LimitOperator/VectorLimitOperator can bail out for the later tasks 
in the operator's init phase itself. 

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorLimitOperator.java#L57

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/LimitOperator.java#L58



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24205) Optimise CuckooSetBytes

2020-09-28 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24205:
---

 Summary: Optimise CuckooSetBytes
 Key: HIVE-24205
 URL: https://issues.apache.org/jira/browse/HIVE-24205
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png

{{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for 
lookup.

!Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!

One option to optimize would be to add boundary conditions on "length" with the 
min/max length stored in the hashes. This would significantly reduce the number 
of hash computation that needs to happen. E.g 
[TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24116) LLAP: Provide an opportunity for preempted tasks to get better locality in next iteration

2020-09-03 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24116:
---

 Summary: LLAP: Provide an opportunity for preempted tasks to get 
better locality in next iteration
 Key: HIVE-24116
 URL: https://issues.apache.org/jira/browse/HIVE-24116
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan


In certain DAGs, tasks get preempted as higher priority tasks need to be 
executed. These preempted tasks are scheduled to run later, but they end up 
missing locality information. Ref: HIVE-24061

Remote stoarge reads can be avoided, if an opportunity is provided for these 
preempted to get better locality in next iteration.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24075) Optimise KeyValuesInputMerger

2020-08-25 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24075:
---

 Summary: Optimise KeyValuesInputMerger
 Key: HIVE-24075
 URL: https://issues.apache.org/jira/browse/HIVE-24075
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Comparisons in KeyValueInputMerger can be reduced.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValuesInputMerger.java#L165|https://github.infra.cloudera.com/CDH/hive/blob/cdpd-master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValuesInputMerger.java#L165]

[https://github.infra.cloudera.com/CDH/hive/blob/cdpd-master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValuesInputMerger.java#L150|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValuesInputMerger.java#L150]

If the reader comparisons in the queue are same, we could reuse 
"{{nextKVReaders}}" in next subsequent iteration instead of doing the 
comparison all over again.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValuesInputMerger.java#L178]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24061) Improve llap task scheduling for better cache hit rate

2020-08-24 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24061:
---

 Summary: Improve llap task scheduling for better cache hit rate 
 Key: HIVE-24061
 URL: https://issues.apache.org/jira/browse/HIVE-24061
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


TaskInfo is initialized with the "requestTime and locality delay". When lots of 
vertices are in the same level, "taskInfo" details would be available upfront. 
By the time, it gets to scheduling, "requestTime + localityDelay" won't be 
higher than current time. Due to this, it misses scheduling delay details and 
ends up choosing random node. This ends up missing cache hits and reads data 
from remote storage.

E.g Observed this pattern in Q75 of tpcds.

Related lines of interest in scheduler: 
[https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
 
|https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java]
{code:java}
   boolean shouldDelayForLocality = 
request.shouldDelayForLocality(schedulerAttemptTime);
..
..
boolean shouldDelayForLocality(long schedulerAttemptTime) {
  return localityDelayTimeout > schedulerAttemptTime;
}
{code}
 

Ideally, "localityDelayTimeout" should be adjusted based on it's first 
scheduling opportunity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24029) MV fails for queries with subqueries

2020-08-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24029:
---

 Summary: MV fails for queries with subqueries
 Key: HIVE-24029
 URL: https://issues.apache.org/jira/browse/HIVE-24029
 Project: Hive
  Issue Type: Sub-task
  Components: Materialized views
Reporter: Rajesh Balamohan



{noformat}
 explain create materialized view q16 as select  
   count(distinct cs_order_number) as `order count`
  ,sum(cs_ext_ship_cost) as `total shipping cost`
  ,sum(cs_net_profit) as `total net profit`
from
   catalog_sales cs1
  ,date_dim
  ,customer_address
  ,call_center
where
d_date between '1999-4-01' and 
   (cast('1999-4-01' as date) + 60 days)
and cs1.cs_ship_date_sk = d_date_sk
and cs1.cs_ship_addr_sk = ca_address_sk
and ca_state = 'IL'
and cs1.cs_call_center_sk = cc_call_center_sk
and cc_county in ('Richland County','Bronx County','Maverick County','Mesa 
County',
  'Raleigh County'
)
and exists (select *
from catalog_sales cs2
where cs1.cs_order_number = cs2.cs_order_number
  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and not exists(select *
   from catalog_returns cr1
   where cs1.cs_order_number = cr1.cr_order_number)
{noformat}

Error
{noformat}
Error: Error while compiling statement: FAILED: SemanticException [Error 
10249]: Line 24:8 Unsupported SubQuery Expression 'cr_order_number': Only 1 
SubQuery expression is supported. (state=42000,code=10249)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24028) MV query fails with CalciteViewSemanticException

2020-08-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24028:
---

 Summary: MV query fails with CalciteViewSemanticException
 Key: HIVE-24028
 URL: https://issues.apache.org/jira/browse/HIVE-24028
 Project: Hive
  Issue Type: Sub-task
  Components: Materialized views
Reporter: Rajesh Balamohan



{noformat}
explain create materialized view qmv39 as 
with inv as
(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
   ,stdev,mean, case mean when 0 then null else stdev/mean end cov
 from(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
,stddev_samp(inv_quantity_on_hand) stdev,avg(inv_quantity_on_hand) 
mean
  from inventory
  ,item
  ,warehouse
  ,date_dim
  where inv_item_sk = i_item_sk
and inv_warehouse_sk = w_warehouse_sk
and inv_date_sk = d_date_sk
and d_year =2000
  group by w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy) foo
 where case mean when 0 then 0 else stdev/mean end > 1)
select inv1.w_warehouse_sk,inv1.i_item_sk,inv1.d_moy,inv1.mean, inv1.cov
,inv2.w_warehouse_sk,inv2.i_item_sk,inv2.d_moy,inv2.mean, inv2.cov
from inv inv1,inv inv2
where inv1.i_item_sk = inv2.i_item_sk
  and inv1.w_warehouse_sk =  inv2.w_warehouse_sk
  and inv1.d_moy=2
  and inv2.d_moy=2+1
{noformat}

{noformat}
Error: Error while compiling statement: FAILED: SemanticException 
org.apache.hadoop.hive.ql.optimizer.calcite.CalciteViewSemanticException: 
Duplicate column name: w_warehouse_sk (state=42000,code=4)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24027) Add support for `intersect` keyword in MV

2020-08-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24027:
---

 Summary: Add support for `intersect` keyword in MV
 Key: HIVE-24027
 URL: https://issues.apache.org/jira/browse/HIVE-24027
 Project: Hive
  Issue Type: Sub-task
  Components: Materialized views
Reporter: Rajesh Balamohan



{noformat}
explain create materialized view mv as  select distinct c_last_name, 
c_first_name, d_date
from store_sales, date_dim, customer
  where store_sales.ss_sold_date_sk = date_dim.d_date_sk
  and store_sales.ss_customer_sk = customer.c_customer_sk
  and d_month_seq between 1186 and 1186 + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
  where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
  and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between 1186 and 1186 + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
  where web_sales.ws_sold_date_sk = date_dim.d_date_sk
  and web_sales.ws_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between 1186 and 1186 + 11
{noformat}


This query fails with the following error msg

{noformat}
Error: Error while compiling statement: FAILED: SemanticException Cannot enable 
automatic rewriting for materialized view. Statement has unsupported operator: 
intersect. (state=42000,code=4)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24022) Optimise HiveMetaStoreAuthorizer.createHiveMetaStoreAuthorizer

2020-08-10 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24022:
---

 Summary: Optimise 
HiveMetaStoreAuthorizer.createHiveMetaStoreAuthorizer
 Key: HIVE-24022
 URL: https://issues.apache.org/jira/browse/HIVE-24022
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


For a table with 3000+ partitions, analyze table takes a lot longer time as 
HiveMetaStoreAuthorizer tries to create HiveConf for every partition request.

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/security/authorization/plugin/metastore/HiveMetaStoreAuthorizer.java#L319]

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/security/authorization/plugin/metastore/HiveMetaStoreAuthorizer.java#L447]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23953) Use task counter information to compute keycount during hashtable loading

2020-07-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23953:
---

 Summary: Use task counter information to compute keycount during 
hashtable loading
 Key: HIVE-23953
 URL: https://issues.apache.org/jira/browse/HIVE-23953
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


There are cases when compiler misestimates key count and this results in a 
number of hashtable resizes during runtime.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]

In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
counter from upstream to compute the key count more accurately at runtime.

 
 * 
 * 
Options
h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23952) Reuse VectorAggregationBuffer to reduce GC pressure in VectorGroupByOperator

2020-07-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23952:
---

 Summary: Reuse VectorAggregationBuffer to reduce GC pressure in 
VectorGroupByOperator
 Key: HIVE-23952
 URL: https://issues.apache.org/jira/browse/HIVE-23952
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-07-30 at 7.38.13 AM.png

!Screenshot 2020-07-30 at 7.38.13 AM.png|width=1171,height=892!

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java]
{code:java}
aggregationBuffer = allocateAggregationBuffer(); {code}
Flushed out aggregation buffers could be reused instead of allocating everytime 
here, to reduce GC pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23936) Provide approximate number of input records to be processed in broadcast reader

2020-07-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23936:
---

 Summary: Provide approximate number of input records to be 
processed in broadcast reader
 Key: HIVE-23936
 URL: https://issues.apache.org/jira/browse/HIVE-23936
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


There are cases when broadcasted data is loaded into hashtable in upstream 
applications (e.g Hive). Apps tends to predict the number of entries in the 
hashtable diligently, but there are cases where these estimates can be very 
complicated at compile time.

 

Tez can help in such cases, by providing "approximate number of input records 
counter", to be processed in UnorderedKVInput. This is to avoid expensive 
rehash when hashtable sizes are not estimated correctly. It would be good to 
start with broadcast first and then to move on to unordered partitioned case 
later.

 

This would help in predicting the number of entries at runtime & can get better 
estimates for hashtable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23917) Reset key access count during eviction in VectorGroupByOperator

2020-07-23 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23917:
---

 Summary: Reset key access count during eviction in 
VectorGroupByOperator
 Key: HIVE-23917
 URL: https://issues.apache.org/jira/browse/HIVE-23917
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


Follow up of https://issues.apache.org/jira/browse/HIVE-23843

There can be a case (depending on data) where large number of entries in the 
aggregation map, exceed the average access and it could prevent 10% flushing 
limit. Adding reset on the evicted entries would help preventing this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23884) SemanticAnalyze exception when addressing field with table name in group by

2020-07-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23884:
---

 Summary: SemanticAnalyze exception when addressing field with 
table name in group by
 Key: HIVE-23884
 URL: https://issues.apache.org/jira/browse/HIVE-23884
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


{noformat}
explain cbo 
select  `item`.`i_item_id`,
`store`.`s_state`, grouping(s_state) `g_state` from  
`tpcds_bin_partitioned_orc_1`.`store`, 
`tpcds_bin_partitioned_orc_1`.`item`
where `store`.`s_state` in ('AL','IN', 'SC', 'NY', 'OH', 'FL')
group by rollup (`item`.`i_item_id`, `s_state`)

CBO PLAN:


HiveProject(i_item_id=[$0], s_state=[$1], g_state=[grouping($2, 0:BIGINT)])
  HiveAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], 
GROUPING__ID=[GROUPING__ID()])
HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not 
available])
  HiveProject(i_item_id=[$1])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, item]], 
table:alias=[item])
  HiveProject(s_state=[$24])
HiveFilter(condition=[IN($24, _UTF-16LE'AL', _UTF-16LE'IN', 
_UTF-16LE'SC', _UTF-16LE'NY', _UTF-16LE'OH', _UTF-16LE'FL')])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, store]], 
table:alias=[store])
{noformat}
 

However, adding fully qualified field name "*`store`.`s_state`*"" in the second 
rollup throws SemanticAnalyzer exception

 
{noformat}
explain cbo 
select  `item`.`i_item_id`,
`store`.`s_state`, grouping(s_state) `g_state` from  
`tpcds_bin_partitioned_orc_1`.`store`, 
`tpcds_bin_partitioned_orc_1`.`item`
where `store`.`s_state` in ('AL','IN', 'SC', 'NY', 'OH', 'FL')
group by rollup (`item`.`i_item_id`, `store`.`s_state`)

Error: Error while compiling statement: FAILED: RuntimeException [Error 10409]: 
Expression in GROUPING function not present in GROUP BY (state=42000,code=10409)

{noformat}
Exception: based on 3.x; but mostly should occur in master as well.

Related ticket: https://issues.apache.org/jira/browse/HIVE-15996
{noformat}
Caused by: java.lang.RuntimeException: Expression in GROUPING function not 
present in GROUP BY
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer$2.post(SemanticAnalyzer.java:3296)
 ~[hive-exec-3.1xyz]
at org.antlr.runtime.tree.TreeVisitor.visit(TreeVisitor.java:66) 
~[antlr-runtime-3.5.2.jar:3.5.2]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.rewriteGroupingFunctionAST(SemanticAnalyzer.java:3305)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4616)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4392)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11026)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10965)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11894)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11764)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12568)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:707)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12669)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:426)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:288)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:170)
 ~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:288)
 ~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:221) 
~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104) 
~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:188) 
~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:598) 
~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:544) 
~[hive-exec-3.1xyz]
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:538) 
~[hive-exec-3.1xyz]
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:127)
 ~[hive-exec-3.1xyz
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23878) Aggregate after join throws off MV rewrite

2020-07-19 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23878:
---

 Summary: Aggregate after join throws off MV rewrite 
 Key: HIVE-23878
 URL: https://issues.apache.org/jira/browse/HIVE-23878
 Project: Hive
  Issue Type: Sub-task
  Components: Materialized views
Reporter: Rajesh Balamohan


E.g Q81, Q30, Q45, Q68: In all these queries, MV rewrites are disabled for 
{{customer, customer-address}} MV.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23870) Optimise multiple text conversions in WritableHiveCharObjectInspector.getPrimitiveJavaObjec

2020-07-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23870:
---

 Summary: Optimise multiple text conversions in 
WritableHiveCharObjectInspector.getPrimitiveJavaObjec
 Key: HIVE-23870
 URL: https://issues.apache.org/jira/browse/HIVE-23870
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: image-2020-07-17-11-31-38-241.png

Observed this when creating materialized view.

[https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableHiveCharObjectInspector.java#L85]

Same content is converted to Text multiple times.

!image-2020-07-17-11-31-38-241.png|width=1048,height=936!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23843) Improve key evictions in VectorGroupByOperator

2020-07-13 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23843:
---

 Summary: Improve key evictions in VectorGroupByOperator
 Key: HIVE-23843
 URL: https://issues.apache.org/jira/browse/HIVE-23843
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


Keys in {{mapKeysAggregationBuffers}} are evicted in random order. Tasks also 
get into GC issues when multiple keys are involved in groupbys. It would be 
good to provide an option to have LRU based eviction for 
mapKeysAggregationBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23805) ValidReadTxnList need not be constructed multiple times in AcidUtils::getAcidState

2020-07-06 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23805:
---

 Summary: ValidReadTxnList need not be constructed multiple times 
in AcidUtils::getAcidState 
 Key: HIVE-23805
 URL: https://issues.apache.org/jira/browse/HIVE-23805
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-07-06 at 4.53.44 PM.png

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L1273]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L1286]

 
{code:java}
String s = conf.get(ValidTxnList.VALID_TXNS_KEY);
  
  
if(!Strings.isNullOrEmpty(s)) {
  
 ...
 ...
  validTxnList.readFromString(s);
  
  
} {code}
 

 

!Screenshot 2020-07-06 at 4.53.44 PM.png|width=610,height=621!

AM spends good amount of CPU parsing the same validtxnlist multiple times.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23788) FilterStatsRule misestimate causes hashtable computation to rehash often

2020-07-01 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23788:
---

 Summary: FilterStatsRule misestimate causes hashtable computation 
to rehash often
 Key: HIVE-23788
 URL: https://issues.apache.org/jira/browse/HIVE-23788
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Depending on available statistics, FilterStatsRule estimates the rows as 
numRows/3 at times. This causes, lower keyCount to be projected for hashtable 
computation causing rehashing often.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L952]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1192]

E.g TPCDS Q74 @ 10TB. But as part of evaluating "t_s_firstyear.year_total > 0, 
t_w_secyear.year_total / t_w_firstyear.year_total , t_s_secyear.year_total / 
t_s_firstyear.year_total " conditions, it projects 1/3rd of the rows causing 
rehashing of hashtable in downstream vertex.

May have to check whether stats can be projected for these columns correctly.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23754) LLAP: Add LoggingHandler in ShuffleHandler pipeline for better debuggability

2020-06-23 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23754:
---

 Summary: LLAP: Add LoggingHandler in ShuffleHandler pipeline for 
better debuggability
 Key: HIVE-23754
 URL: https://issues.apache.org/jira/browse/HIVE-23754
 Project: Hive
  Issue Type: Improvement
 Environment: 
[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/shufflehandler/ShuffleHandler.java#L616]

 

For corner case debugging, it would be helpful to understand when netty 
processed OPEN/BOUND/CLOSE/RECEIVED/CONNECTED events along with payload details.

Adding "LoggingHandler" in ChannelPipeline mode can help in debugging.

 
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23739) ShuffleHandler: Unordered partitioned data could be evicted immediately after transfer

2020-06-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23739:
---

 Summary: ShuffleHandler: Unordered partitioned data could be 
evicted immediately after transfer
 Key: HIVE-23739
 URL: https://issues.apache.org/jira/browse/HIVE-23739
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/shufflehandler/ShuffleHandler.java#L1019]

Not optimal for unordered partitioned, as it ends up evicting the data 
immediately. (E.g Q78)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23738) DBLockManager::lock() : Move lock request to debug level

2020-06-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23738:
---

 Summary: DBLockManager::lock() : Move lock request to debug level
 Key: HIVE-23738
 URL: https://issues.apache.org/jira/browse/HIVE-23738
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbLockManager.java#L102]

 

For Q78 @30TB scale, it ends up dumping couple of MBs of log in info level to 
print the lock request type. If possible, this should be moved to debug level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23735) Reducer misestimate for export command

2020-06-21 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23735:
---

 Summary: Reducer misestimate for export command
 Key: HIVE-23735
 URL: https://issues.apache.org/jira/browse/HIVE-23735
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L6869

{code}
if (dest_tab.getNumBuckets() > 0) {
...
}
{code}

For "export" command, HS2 creates a dummy table and for this table and gets "1" 
as the number of buckets.

{noformat}
set hive.stats.autogather=false;
export table sample_table to '/tmp/export/sampe_db/t1';
{noformat}

This causes issues in reducer estimates and always lands up with '1' as the 
number of reducer task. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23597) VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories multiple times

2020-06-03 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23597:
---

 Summary: 
VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete 
delta directories multiple times
 Key: HIVE-23597
 URL: https://issues.apache.org/jira/browse/HIVE-23597
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562]
{code:java}
try {
final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
if (deleteDeltaDirs.length > 0) {
  int totalDeleteEventCount = 0;
  for (Path deleteDeltaDir : deleteDeltaDirs) {
{code}
 

Consider a directory layout like the following. This was created by having 
simple set of "insert --> update --> select" queries.

 
{noformat}
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_001
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_002
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_003_003_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_004_004_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_005_005_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_006_006_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_007_007_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_008_008_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_009_009_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_010_010_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_011_011_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_012_012_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_013_013_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_003_003_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_004_004_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_005_005_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_006_006_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_007_007_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_008_008_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_009_009_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_010_010_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_011_011_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_012_012_
/warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_013_013_
 {noformat}
 

Orcsplit contains all the delete delta folder information. For the directory 
layout like this, it would create {{~12 splits}}. For every split, it 
constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up reading 
all these delete delta folders multiple times.
 In this case, it would read it approximately {{121 times!}}.

This causes huge delay in running simple queries like "{{select * from tab_x}}" 
in cloud storage. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23559) Optimise Hive::moveAcidFiles for cloud storage

2020-05-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23559:
---

 Summary: Optimise Hive::moveAcidFiles for cloud storage
 Key: HIVE-23559
 URL: https://issues.apache.org/jira/browse/HIVE-23559
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L4752]

It ends up transferring DELTA, DELETE_DELTA, BASE prefixes sequentially from 
staging to final location.

This causes delays even with simple updates statements, which updates smaller 
number of records in cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23551) Acid: Update queries should purge dir cache entry in AcidUtils

2020-05-26 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23551:
---

 Summary: Acid: Update queries should purge dir cache entry in 
AcidUtils
 Key: HIVE-23551
 URL: https://issues.apache.org/jira/browse/HIVE-23551
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan


Update statements create delta folders at the end of the execution. When 
{{insert overwrite}} followed by {{update}} is executed, it does not get any 
open txns and ends up caching the {{base}} folder. However, the delta folder 
which gets created at the end of the statement never makes it to the cache. 
This creates wrong results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23521) REPL: Optimise partition loading during bootstrap

2020-05-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23521:
---

 Summary: REPL: Optimise partition loading during bootstrap
 Key: HIVE-23521
 URL: https://issues.apache.org/jira/browse/HIVE-23521
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


When bootstrapping with large "REPL dump" with ~10K partitions, it starts 
executing "addPartition" in sequential manner and takes very long time as it 
communicates with HMS/registers partition etc for every call.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/table/LoadPartitions.java#L399]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/table/LoadPartitions.java#L165]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/table/LoadPartitions.java#L210]

When bootstrap loading has to deal with DDL, it would be good to collate all 
partitions in single call to HMS. This would help in reducing overall runtime.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23520) REPL: repl dump could add support for immutable dataset

2020-05-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23520:
---

 Summary: REPL: repl dump could add support for immutable dataset
 Key: HIVE-23520
 URL: https://issues.apache.org/jira/browse/HIVE-23520
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Currently, "REPL DUMP" ends up copying entire dataset along with partition 
information, stats etc in its dump folder. However, there are cases (e.g large 
reference datasets), where we need a way just retain metadata along with 
partition information & stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23499) REPL: repl load should honor "hive.repl.dump.metadata.only=true"

2020-05-18 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23499:
---

 Summary: REPL: repl load should honor 
"hive.repl.dump.metadata.only=true"
 Key: HIVE-23499
 URL: https://issues.apache.org/jira/browse/HIVE-23499
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


"{{hive.repl.dump.metadata.only=true"}} is not currently honored during "{{repl 
load"}}. Currently, it ends up copying all files even if this option is 
specified in "repl load".  E.g
{noformat}
repl load airline_ontime_orc into another_airline_ontime_orc with 
('hive.repl.rootdir'='s3a://blah/', 'hive.repl.dump.metadata.only'='true'); 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23488) Optimise PartitionManagementTask::Msck::repair

2020-05-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23488:
---

 Summary: Optimise PartitionManagementTask::Msck::repair
 Key: HIVE-23488
 URL: https://issues.apache.org/jira/browse/HIVE-23488
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-05-18 at 5.06.15 AM.png

Ends up fetching table information twice.

!Screenshot 2020-05-18 at 5.06.15 AM.png|width=1084,height=754!

 

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L113]

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java#L234]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23487) Optimise PartitionManagementTask

2020-05-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23487:
---

 Summary: Optimise PartitionManagementTask
 Key: HIVE-23487
 URL: https://issues.apache.org/jira/browse/HIVE-23487
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-05-18 at 4.19.48 AM.png

Msck.init for every table takes more time than the actual table repair. This 
was observed on a system which had lots of DB and tables.

 

  !Screenshot 2020-05-18 at 4.19.48 AM.png|width=1014,height=732!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23468) LLAP: Optimise OrcEncodedDataReader to avoid FS init to NN

2020-05-14 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23468:
---

 Summary: LLAP: Optimise OrcEncodedDataReader to avoid FS init to NN
 Key: HIVE-23468
 URL: https://issues.apache.org/jira/browse/HIVE-23468
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


OrcEncodedDataReader materializes the supplier to check if it is a HDFS 
system or not. This causes unwanted call to NN even in cases when cache is 
completely warmed up.

[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java#L540]

[https://github.com/apache/hive/blob/9f40d7cc1d889aa3079f3f494cf810fabe326e44/ql/src/java/org/apache/hadoop/hive/ql/io/HdfsUtils.java#L107]

Workaround is to set "hive.llap.io.use.fileid.path=false" to avoid this case.

IO elevator could get 100% cache hit from FileSystem impl in warmed up scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23459) Reduce number of listPath calls in AcidUtils::getAcidState

2020-05-13 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23459:
---

 Summary: Reduce number of listPath calls in AcidUtils::getAcidState
 Key: HIVE-23459
 URL: https://issues.apache.org/jira/browse/HIVE-23459
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: image-2020-05-13-13-57-27-270.png

There are atleast 3 places where listPaths is invoked for FS (highlighted in 
the follow profile).

!image-2020-05-13-13-57-27-270.png|width=869,height=626!

 

Dir caching works mainly for BI strategy and when there are no-delta files. It 
would be good to consider reducing number of NN calls to reduce getSplits time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23451) FileSinkOperator calls deleteOnExit (hdfs call) twice for the same file

2020-05-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23451:
---

 Summary: FileSinkOperator calls deleteOnExit (hdfs call) twice for 
the same file
 Key: HIVE-23451
 URL: https://issues.apache.org/jira/browse/HIVE-23451
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L826]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L797]

Can avoid a NN call here (i.e, mainly for small queries).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23449) LLAP: Reduce mkdir and config creations in submitWork hotpath

2020-05-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23449:
---

 Summary: LLAP: Reduce mkdir and config creations in submitWork 
hotpath
 Key: HIVE-23449
 URL: https://issues.apache.org/jira/browse/HIVE-23449
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-05-12 at 1.09.35 PM.png

!Screenshot 2020-05-12 at 1.09.35 PM.png|width=885,height=558!

 

For short jobs, submitWork gets into hotpath. This can lazy load conf and can 
get rid of dir creations (which needs to be enabled only when DirWatcher is 
enabled)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23446) LLAP: Reduce IPC connection misses to AM for short queries

2020-05-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23446:
---

 Summary: LLAP: Reduce IPC connection misses to AM for short queries
 Key: HIVE-23446
 URL: https://issues.apache.org/jira/browse/HIVE-23446
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/QueryInfo.java#L343]

 

Umbilical UGI pool for is maintained at QueryInfo level. When there are lots of 
short queries, this ends up missing IPC cache and ends up recreating 
threads/connections to the same AM.

It would be good to maintain this pool in {{ContainerRunnerImpl}} instead and 
recycle as needed.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23430) Optimise ASTNode::getChildren

2020-05-10 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23430:
---

 Summary: Optimise ASTNode::getChildren 
 Key: HIVE-23430
 URL: https://issues.apache.org/jira/browse/HIVE-23430
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: image-2020-05-11-09-09-37-119.png

!image-2020-05-11-09-09-37-119.png|width=1276,height=930!

 

Pink bars are from ASTNode::getChildren. Observed this, when large number of 
small queries were executed in HS2 concurrently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23429) LLAP: Optimize retrieving queryId details in LlapTaskCommunicator

2020-05-10 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23429:
---

 Summary: LLAP: Optimize retrieving queryId details in 
LlapTaskCommunicator
 Key: HIVE-23429
 URL: https://issues.apache.org/jira/browse/HIVE-23429
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


 
[https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskCommunicator.java#L825]

 

For small jobs, this becomes a bottleneck to unpack entire payload to lookup 
HIVEQUERYID parameter. It would be good to share HIVEQUERYID in dagConf and 
retrieve it back via DagInfo.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23376) Avoid repeated SHA computation in GenericUDTFGetSplits for hive-exec jar

2020-05-06 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23376:
---

 Summary: Avoid repeated SHA computation in GenericUDTFGetSplits 
for hive-exec jar
 Key: HIVE-23376
 URL: https://issues.apache.org/jira/browse/HIVE-23376
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: image-2020-05-06-16-37-48-615.png

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFGetSplits.java#L706]

 

 

!image-2020-05-06-16-37-48-615.png|width=946,height=639!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23318) TxnHandler need not delete from MATERIALIZATION_REBUILD_LOCKS on need basis

2020-04-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23318:
---

 Summary: TxnHandler need not delete from 
MATERIALIZATION_REBUILD_LOCKS on need basis
 Key: HIVE-23318
 URL: https://issues.apache.org/jira/browse/HIVE-23318
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


Observed the following queries when materialized view or any of its feature was 
not used. 

TxnHandler need not clear this part of txn commit. It would help in reducing 
the sql parsing time in server side as well.

{noformat}
  Gdelete from MATERIALIZATION_REBUILD_LOCKS where mrl_txn_id = 120398082
  Gdelete from MATERIALIZATION_REBUILD_LOCKS where mrl_txn_id = 120398084
  Gdelete from MATERIALIZATION_REBUILD_LOCKS where mrl_txn_id = 120398112
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23294) Remove sync bottlneck in TezConfigurationFactory

2020-04-24 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23294:
---

 Summary: Remove sync bottlneck in TezConfigurationFactory
 Key: HIVE-23294
 URL: https://issues.apache.org/jira/browse/HIVE-23294
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-04-24 at 1.53.20 PM.png

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezConfigurationFactory.java#L53]

[https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L1628]

It ends up locking for property names in the config. For short running queries 
with concurrency, this is an issue.

 

!Screenshot 2020-04-24 at 1.53.20 PM.png|width=1086,height=459!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23292) Reduce PartitionDesc payload in MapWork

2020-04-24 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23292:
---

 Summary: Reduce PartitionDesc payload in MapWork
 Key: HIVE-23292
 URL: https://issues.apache.org/jira/browse/HIVE-23292
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java#L105



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23282) Reduce number of DB calls in ObjectStore::getPartitionsByExprInternal

2020-04-23 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23282:
---

 Summary: Reduce number of DB calls in 
ObjectStore::getPartitionsByExprInternal
 Key: HIVE-23282
 URL: https://issues.apache.org/jira/browse/HIVE-23282
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan
 Attachments: image-2020-04-23-14-07-06-077.png

ObjectStore::getPartitionsByExprInternal internally uses Table information for 
getting partitionKeys, table, catalog name.

 

For this, it ends up populating entire table data from DB (including skew 
column, parameters, sort, bucket cols etc). This makes it a lot more expensive 
call. It would be good to either have a lightweight object to have basic 
information or reduce the payload on Table object itself.

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L3327]

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L3669]

 

!image-2020-04-23-14-07-06-077.png|width=665,height=592!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23281) ObjectStore::convertToStorageDescriptor can be optimised to reduce calls to DB for ACID tables

2020-04-23 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23281:
---

 Summary: ObjectStore::convertToStorageDescriptor can be optimised 
to reduce calls to DB for ACID tables
 Key: HIVE-23281
 URL: https://issues.apache.org/jira/browse/HIVE-23281
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1980]

 

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1982]

 

SkewInfo, bucketCols, ordering etc are not needed for ACID tables. It may be 
good to check for transactional tables and get rid these calls in table lookups.

 

This should help in reducing DB calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23277) HiveProtoLogger should carry out JSON conversion in its own thread

2020-04-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23277:
---

 Summary: HiveProtoLogger should carry out JSON conversion in its 
own thread
 Key: HIVE-23277
 URL: https://issues.apache.org/jira/browse/HIVE-23277
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-04-23 at 11.27.42 AM.png

!Screenshot 2020-04-23 at 11.27.42 AM.png|width=623,height=423!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23261) Check whether encryption is enabled in the cluster before moving files

2020-04-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23261:
---

 Summary: Check whether encryption is enabled in the cluster before 
moving files
 Key: HIVE-23261
 URL: https://issues.apache.org/jira/browse/HIVE-23261
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Similar to HIVE-23212, there is an unwanted check of encryption paths during 
file move operation.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L4546]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23218) LlapRecordReader queue limit computation is not optimal

2020-04-15 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23218:
---

 Summary: LlapRecordReader queue limit computation is not optimal
 Key: HIVE-23218
 URL: https://issues.apache.org/jira/browse/HIVE-23218
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan


After decoding {{OrcEncodedDataConsumer::decodeBatch}}, data is enqueued into a 
queue in LlapRecordReader. Queue limit for this queue is determined in 
LlapRecordReader. If it is minimal, it ends up waiting for 100ms until it gets 
capacity.

https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapRecordReader.java#L168

https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapRecordReader.java#L590

https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapRecordReader.java#L260

{{determineQueueLimit}} takes into consideration all columns though only few 
columns are needed for projection. Here is an example.

{noformat}

create table test_acid(a1 string, a2 string, a3 string, a4 string, a5 string, 
a6 string, a7 string, a8 string, a9 string, a10 string,
a11 string, a22 string, a33 string, a44 string, a55 string, a66 string, a77 
string, a88 string, a99 string, a100 string,
a111 decimal(25,2), a222 decimal(25,2), a333 decimal(25,2), a444 decimal(25,2), 
a555 decimal(25,2), a666 decimal(25,2), a777 decimal(25,2),
 a888 decimal(25,2), a999 decimal(25,2), a1000 decimal(25,2)) stored as orc;

insert into table test_acid values 
("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10",
"a11","a22","a33","a44","a55","a66","a77","a88","a99","a100",
10.23,10.23,10.23,10.23,10.23,10.23,10.23,10.23,10.23,10.23
);

select a44, count(*) from test_acid where a44 like "a4%" group by a44 order by 
a44;

{noformat}

For this query, queue size predicted would be "138" as it takes into account 
all fields instead of just 2. This would causes unwanted delays in adding data 
to the queue.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23212) SemanticAnalyzer::getStagingDirectoryPathname should check for encryption zone only when needed

2020-04-15 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23212:
---

 Summary: SemanticAnalyzer::getStagingDirectoryPathname should 
check for encryption zone only when needed
 Key: HIVE-23212
 URL: https://issues.apache.org/jira/browse/HIVE-23212
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L2572]

 

When cluster does not have encryption zones configured, this ends up making 2 
calls to NN unnecessarily. It would be good to guard it with config or check 
for the KMS config from HDFS and invoke it on need basis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23210) Fix shortestjobcomparator when jobs submitted have 1 task their vertices

2020-04-15 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23210:
---

 Summary: Fix shortestjobcomparator when jobs submitted have 1 task 
their vertices
 Key: HIVE-23210
 URL: https://issues.apache.org/jira/browse/HIVE-23210
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


In latency sensitive queries, lots of jobs can have vertices with 1 task. 
Currently shortestjobcomparator does not work correctly and returns tasks in 
random order.

[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/comparator/ShortestJobFirstComparator.java#L51]

This causes delay in the job runtime. I will attach a simple test case shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23208) Update guranteed capacity in ZK only when WM is enabled

2020-04-14 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23208:
---

 Summary: Update guranteed capacity in ZK only when WM is enabled
 Key: HIVE-23208
 URL: https://issues.apache.org/jira/browse/HIVE-23208
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java#L1091]

 

Though WM is not enabled, it ends up updating ZK for every dag completion 
event. For short running queries with concurrency, this ends up with lots of 
calls to ZK.

It would be good to invoke this only when WM is enabled.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23196) Reduce number of delete calls to NN during Context::clear

2020-04-14 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23196:
---

 Summary: Reduce number of delete calls to NN during Context::clear
 Key: HIVE-23196
 URL: https://issues.apache.org/jira/browse/HIVE-23196
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


{\{Context::clear()}} ends up deleting same directories (or its subdirs) 
multiple times. It would be good to reduce the number of delete calls to NN for 
latency sensitive queries. This also has an impact on concurrent queries.

{noformat}
2020-04-14T04:22:28,703 DEBUG [7c6a6b09-ab37-4bc8-93a5-5da6fb154899 
HiveServer2-Handler-Pool: Thread-378] ql.Context: Deleting result dir: 
hdfs://nn1:8020/tmp/hive/rbalamohan/7c6a6b09-ab37-4bc8-93a5-5da6fb154899/hive_2020-04-14_04-22-24_335_8573832618972595103-13/-mr-1
2020-04-14T04:22:28,721 DEBUG [7c6a6b09-ab37-4bc8-93a5-5da6fb154899 
HiveServer2-Handler-Pool: Thread-378] ql.Context: Deleting scratch dir: 
hdfs://nn1:8020/tmp/hive/rbalamohan/7c6a6b09-ab37-4bc8-93a5-5da6fb154899/hive_2020-04-14_04-22-24_335_8573832618972595103-13
2020-04-14T04:22:28,737 DEBUG [7c6a6b09-ab37-4bc8-93a5-5da6fb154899 
HiveServer2-Handler-Pool: Thread-378] ql.Context: Deleting scratch dir: 
hdfs://nn1:8020/tmp/hive/rbalamohan/7c6a6b09-ab37-4bc8-93a5-5da6fb154899/hive_2020-04-14_04-22-24_335_8573832618972595103-13/-mr-1/.hive-staging_hive_2020-04-14_04-22-24_335_8573832618972595103-13
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23154) Fix race condition in Utilities::mvFileToFinalPath

2020-04-08 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23154:
---

 Summary: Fix race condition in Utilities::mvFileToFinalPath
 Key: HIVE-23154
 URL: https://issues.apache.org/jira/browse/HIVE-23154
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: HIVE-23154.1.patch

Utilities::mvFileToFinalPath is used for moving files from "/_tmp.-ext to 
"/-ext" folder. Tasks write data to "_tmp" . Before writing to final 
destination, they are moved to "-ext" folder. As part of it, it has checks to 
ensure that run-away task outputs are not copied to "-ext" folder.

Currently, there is a race condition between computing the snapshot of files to 
be copied and the rename operation. Same issue persists in "insert into" case 
as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23140) Optimise file move in CTAS

2020-04-06 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23140:
---

 Summary: Optimise file move in CTAS 
 Key: HIVE-23140
 URL: https://issues.apache.org/jira/browse/HIVE-23140
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: HIVE-23140.1.patch

FileSinkOperator can be optimized to run file move operation (/_tmp.-ext --> 
/-ext-10002) in parallel fashion. Currently it invokes 
{{Utilities.moveSpecifiedFileStatus}} and renames in sequential mode causing 
delays in cloud storage. FS rename can be used (S3A internally has parallel 
rename operation). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23083) Enable fast serialization in xprod edge

2020-03-26 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23083:
---

 Summary: Enable fast serialization in xprod edge
 Key: HIVE-23083
 URL: https://issues.apache.org/jira/browse/HIVE-23083
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-03-26 at 2.28.34 PM.png

{noformat}
select count(*) from store_sales, store, customer, customer_address where  
ss_store_sk = s_store_sk and s_market_id=10 and ss_customer_sk = c_customer_sk 
and c_birth_country <> upper(ca_country);
{noformat}
This uses "org/apache/hadoop/io/serializer/WritableSerialization" instead of 
TezBytesWritableSerialization.

 

!Screenshot 2020-03-26 at 2.28.34 PM.png|width=812,height=488!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23029) LLAP: Shuffle Handler should support Index Cache configuration

2020-03-16 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23029:
---

 Summary: LLAP: Shuffle Handler should support Index Cache 
configuration
 Key: HIVE-23029
 URL: https://issues.apache.org/jira/browse/HIVE-23029
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-03-16 at 12.08.44 PM.jpg

!Screenshot 2020-03-16 at 12.08.44 PM.jpg|width=580,height=405!

 

Queries like Q78 at large scale misses index cache with unordered edges. (24 * 
1009 = 24216. With the default 10 MB cache size, it can accommodate only 400+ 
entries).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23027) Fix syntax error in llap package.py

2020-03-15 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23027:
---

 Summary: Fix syntax error in llap package.py
 Key: HIVE-23027
 URL: https://issues.apache.org/jira/browse/HIVE-23027
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23002) Optimise LazyBinaryUtils.writeVLong

2020-03-09 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23002:
---

 Summary: Optimise LazyBinaryUtils.writeVLong
 Key: HIVE-23002
 URL: https://issues.apache.org/jira/browse/HIVE-23002
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-03-10 at 5.01.34 AM.jpg

[https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java#L420]

It would be good to add a method which accepts scratch bytes.

 

  !Screenshot 2020-03-10 at 5.01.34 AM.jpg|width=452,height=321!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22984) Optimise FetchOperator when fetching large number of records

2020-03-05 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22984:
---

 Summary: Optimise FetchOperator when fetching large number of 
records
 Key: HIVE-22984
 URL: https://issues.apache.org/jira/browse/HIVE-22984
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: image-2020-03-05-19-11-16-318.png

!image-2020-03-05-19-11-16-318.png|width=676,height=456!

 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java#L149]

 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java#L545]
 
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22975) Optimise TopNKeyFilter with boundary checks

2020-03-04 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22975:
---

 Summary: Optimise TopNKeyFilter with boundary checks
 Key: HIVE-22975
 URL: https://issues.apache.org/jira/browse/HIVE-22975
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-03-04 at 3.26.45 PM.jpg

!Screenshot 2020-03-04 at 3.26.45 PM.jpg|width=507,height=322!

 

It would be good to add boundary checks to reduce cycles spent on topN filter. 
E.g Q43 spends good amount of time in topN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22966) LLAP: Consider including waitTime for comparing attempts in same vertex

2020-03-02 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22966:
---

 Summary: LLAP: Consider including waitTime for comparing attempts 
in same vertex
 Key: HIVE-22966
 URL: https://issues.apache.org/jira/browse/HIVE-22966
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan
 Attachments: HIVE-22966.1.patch

When attempts are compared within same vertex, it should pick up the attempt 
with longest wait time to avoid starvation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22927) LLAP should filter guaranteed tasks for killing in node heartbeat

2020-02-25 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22927:
---

 Summary: LLAP should filter guaranteed tasks for killing in node 
heartbeat 
 Key: HIVE-22927
 URL: https://issues.apache.org/jira/browse/HIVE-22927
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22924) Optimise VectorGroupByOperator::prepareBatchAggregationBufferSets

2020-02-24 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22924:
---

 Summary: Optimise 
VectorGroupByOperator::prepareBatchAggregationBufferSets 
 Key: HIVE-22924
 URL: https://issues.apache.org/jira/browse/HIVE-22924
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-02-24 at 1.40.02 PM.jpg

{{VectorHashKeyWrapperGeneral::equals}} comparison becomes expensive in 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L496]

 

!Screenshot 2020-02-24 at 1.40.02 PM.jpg|width=494,height=328!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22896) Increase fash hashtable size on detecting initial collision

2020-02-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22896:
---

 Summary: Increase fash hashtable size on detecting initial 
collision
 Key: HIVE-22896
 URL: https://issues.apache.org/jira/browse/HIVE-22896
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan


This would help in avoiding collisions and helps in burning lesser CPU cycles 
during probing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22886) NPE in Hive.isOutdatedMaterializedView

2020-02-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22886:
---

 Summary: NPE in Hive.isOutdatedMaterializedView
 Key: HIVE-22886
 URL: https://issues.apache.org/jira/browse/HIVE-22886
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan


{noformat}
parse.CalcitePlanner: Exception loading materialized views
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.metadata.Hive.filterAugmentMaterializedViews(Hive.java:1701)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.metadata.Hive.getPreprocessedMaterializedViewsFromRegistry(Hive.java:1649)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyMaterializedViewRewriting(CalcitePlanner.java:2071)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1851)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1734)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:130) 
~[calcite-core-1.21.0.jar:1.21.0]
at 
org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:915)
 ~[calcite-core-1.21.0.jar:1.21.0]
at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:179) 
~[calcite-core-1.21.0.jar:1.21.0]
at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:125) 
~[calcite-core-1.21.0.jar:1.21.0]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1495)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:471)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12483)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:361)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:286)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:286)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:219) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:103) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:215) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:828) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:774) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:768) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:125)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:203)
 ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:325)
 ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method) 
~[?:1.8.0_112]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_112]
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
 ~[hadoop-common-3.1.0.3.0.0.0-1634.jar:?]
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:345)
 ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_112]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_112]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[?:1.8.0_112]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[?:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
Caused by: 

[jira] [Created] (HIVE-22879) Optimise jar file loading in CalcitePlanner

2020-02-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22879:
---

 Summary: Optimise jar file loading in CalcitePlanner
 Key: HIVE-22879
 URL: https://issues.apache.org/jira/browse/HIVE-22879
 Project: Hive
  Issue Type: Improvement
  Components: CBO
Reporter: Rajesh Balamohan


{{CalcitePlanner }} internally uses {{org.codehaus.janino.UnitCompiler (calcite 
dependency)}} and this appears to load the jars in every thread. Need to check 
if this can be avoided.

Here is an example.

{noformat}
at java.util.zip.ZipFile.getEntry(Native Method)
at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
- locked <0x0005c1af21c0> (a java.util.jar.JarFile)
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x0005caa3be88> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.codehaus.janino.ClassLoaderIClassLoader.findIClass(ClassLoaderIClassLoader.java:89)
at org.codehaus.janino.IClassLoader.loadIClass(IClassLoader.java:312)
- locked <0x000686136868> (a 
org.codehaus.janino.ClassLoaderIClassLoader)
at 
org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:8556)
at 
org.codehaus.janino.UnitCompiler.reclassifyName(UnitCompiler.java:8478)
at 
org.codehaus.janino.UnitCompiler.reclassifyName(UnitCompiler.java:8471)
at org.codehaus.janino.UnitCompiler.reclassify(UnitCompiler.java:8331)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6855)
at org.codehaus.janino.UnitCompiler.access$14200(UnitCompiler.java:215)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6497)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6494)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4224)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6494)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6490)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4148)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6490)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6469)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4116)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6469)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:9026)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:7106)
at org.codehaus.janino.UnitCompiler.access$15800(UnitCompiler.java:215)
at 
org.codehaus.janino.UnitCompiler$22$2.visitMethodInvocation(UnitCompiler.java:6517)
at 
org.codehaus.janino.UnitCompiler$22$2.visitMethodInvocation(UnitCompiler.java:6490)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5073)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6490)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6469)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4116)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6469)
at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9237)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:9123)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:9025)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5062)
at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4423)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4396)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5073)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4396)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5662)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5622)
at 

[jira] [Created] (HIVE-22878) Add caching of table constraints, foreignKeys in CachedStore

2020-02-11 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22878:
---

 Summary: Add caching of table constraints, foreignKeys in 
CachedStore
 Key: HIVE-22878
 URL: https://issues.apache.org/jira/browse/HIVE-22878
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-02-12 at 9.24.27 AM.jpg, Screenshot 
2020-02-12 at 9.25.33 AM.jpg

All pink bars are misses from cachedstore.

!Screenshot 2020-02-12 at 9.24.27 AM.jpg|width=428,height=314!

 

!Screenshot 2020-02-12 at 9.25.33 AM.jpg|width=648,height=470!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22858) HMS broken with mysql

2020-02-09 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22858:
---

 Summary: HMS broken with mysql
 Key: HIVE-22858
 URL: https://issues.apache.org/jira/browse/HIVE-22858
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Using commit 7bb1d1edfcba558958265ec47245bc529eaee2d8 (Jan 27) apache master.

Encountered the following exception when creating new database in hive (mysql 
for HMS).

https://issues.apache.org/jira/browse/HIVE-22663 may be related to this.

 
{noformat}
org.apache.hadoop.hive.metastore.api.MetaException: Unable to select from 
transaction database com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
You have an error in your SQL syntax; check the manual that corresponds to your 
MySQL server version for the right syntax to use near '"NEXT_TXN_ID" for 
update' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.Util.getInstance(Util.java:386)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2828)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2777)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1651)
at 
org.apache.hive.com.zaxxer.hikari.pool.ProxyStatement.executeQuery(ProxyStatement.java:108)
at 
org.apache.hive.com.zaxxer.hikari.pool.HikariProxyStatement.executeQuery(HikariProxyStatement.java)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.openTxns(TxnHandler.java:599)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.openTxns(TxnHandler.java:555)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.open_txns(HiveMetaStore.java:7956)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
at com.sun.proxy.$Proxy28.open_txns(Unknown Source)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$open_txns.getResult(ThriftHiveMetastore.java:19779)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$open_txns.getResult(ThriftHiveMetastore.java:19764)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:107)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:119)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.openTxns(TxnHandler.java:565) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.open_txns(HiveMetaStore.java:7956)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_112]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_112]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at 

[jira] [Created] (HIVE-22850) Optimise lock acquisition in TxnHandler

2020-02-07 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22850:
---

 Summary: Optimise lock acquisition in TxnHandler
 Key: HIVE-22850
 URL: https://issues.apache.org/jira/browse/HIVE-22850
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


With concurrent queries, time taken for lock acquisition increases 
substantially. As part of lock acquisition in the query, 
{{TxnHandler::checkLock}} gets invoked. This involves getting a mutex and 
compare the locks being requested for, with that of existing locks in 
{{HIVE_LOCKS}} table.

With concurrent queries, time taken to do this check increase and this 
significantly increases the time taken for getting mutex for other threads. In 
a synthetic workload, it was in the order of 10+ seconds. This codepath can be 
optimized when all lock requests are SHARED_READ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22825) Reduce directory lookup cost for acid tables

2020-02-04 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22825:
---

 Summary: Reduce directory lookup cost for acid tables
 Key: HIVE-22825
 URL: https://issues.apache.org/jira/browse/HIVE-22825
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


With objectstores, directory lookup costs are expensive. For acid tables, it 
would be good to have a directory cache to reduce number of lookup calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22800) GenericUDFOPDTIPlus should support varchar

2020-01-31 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22800:
---

 Summary: GenericUDFOPDTIPlus should support varchar
 Key: HIVE-22800
 URL: https://issues.apache.org/jira/browse/HIVE-22800
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


{noformat}
create table test(d_date varchar(10) stored as orc;

select d_date + INTERVAL(5) DAY from test;

Error: Error while compiling statement: FAILED: SemanticException [Error 
10014]: Line 1:7 Wrong arguments '5': No matching method for class 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPDTIPlus with (varchar(10), 
interval_day_time) (state=42000,code=10014)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22799) HiveMetaStoreAuthorizer parses conf on every HMS invocation

2020-01-31 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22799:
---

 Summary: HiveMetaStoreAuthorizer parses conf on every HMS 
invocation
 Key: HIVE-22799
 URL: https://issues.apache.org/jira/browse/HIVE-22799
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan


Stacktrace may not match exactly master  branch. But in master as well, it is 
not very different.

{noformat}
at org.apache.hadoop.util.StringInterner.weakIntern(StringInterner.java:71)
at 
org.apache.hadoop.conf.Configuration$Parser.handleEndElement(Configuration.java:3273)
at 
org.apache.hadoop.conf.Configuration$Parser.parseNext(Configuration.java:3354)
at 
org.apache.hadoop.conf.Configuration$Parser.parse(Configuration.java:3137)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:3030)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2991)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2871)
- locked <0x0005cbe60ad0> (a org.apache.hadoop.mapred.JobConf)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1389)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1361)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:430)
at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5482)
at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:5450)
at 
org.apache.hadoop.hive.ql.security.authorization.plugin.metastore.HiveMetaStoreAuthorizer.createHiveMetaStoreAuthorizer(HiveMetaStoreAuthorizer.java:450)
at 
org.apache.hadoop.hive.ql.security.authorization.plugin.metastore.HiveMetaStoreAuthorizer.onEvent(HiveMetaStoreAuthorizer.java:100)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.firePreEvent(HiveMetaStore.java:3835)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database_req(HiveMetaStore.java:1655)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
at com.sun.proxy.$Proxy28.get_database_req(Unknown Source)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_database_req.getResult(ThriftHiveMetastore.java:15671)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_database_req.getResult(ThriftHiveMetastore.java:15655)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22798) Fix/Optimize: PrimitiveTypeInfo::getPrimitiveTypeEntry

2020-01-31 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22798:
---

 Summary: Fix/Optimize: PrimitiveTypeInfo::getPrimitiveTypeEntry
 Key: HIVE-22798
 URL: https://issues.apache.org/jira/browse/HIVE-22798
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan
 Attachments: image-2020-01-31-14-22-45-372.png

!image-2020-01-31-14-22-45-372.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22792) Fix NPE in VectorColumnOutputMapping.finalize

2020-01-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22792:
---

 Summary: Fix NPE in VectorColumnOutputMapping.finalize
 Key: HIVE-22792
 URL: https://issues.apache.org/jira/browse/HIVE-22792
 Project: Hive
  Issue Type: Improvement
  Components: Vectorization
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-01-30 at 1.06.37 PM.png

 !Screenshot 2020-01-30 at 1.06.37 PM.png! 

Vectorizer already invokes finalize() explicitly making {{vectorColumnMapping}} 
null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22789) Table details are retrieved twice during query compilation phase

2020-01-28 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22789:
---

 Summary: Table details are retrieved twice during query 
compilation phase
 Key: HIVE-22789
 URL: https://issues.apache.org/jira/browse/HIVE-22789
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


https://issues.apache.org/jira/browse/HIVE-22366 takes care of normalizing 
table details so that {{tabNameToTabObject}} is effectively used in 
CalcitePlanner.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L513
 effectively invalidates entire {{tabNameToTabObject}} cache and forces to 
recompute the details. It would be good to check whether this can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22786) Agg with distinct can be optimised in HASH mode

2020-01-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22786:
---

 Summary: Agg with distinct can be optimised in HASH mode
 Key: HIVE-22786
 URL: https://issues.apache.org/jira/browse/HIVE-22786
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22753) Fix gradual mem leak: Operationlog related appenders should be cleared up on errors

2020-01-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22753:
---

 Summary: Fix gradual mem leak: Operationlog related appenders 
should be cleared up on errors 
 Key: HIVE-22753
 URL: https://issues.apache.org/jira/browse/HIVE-22753
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: image-2020-01-21-11-14-37-911.png

In case of exception in SQLOperation, operational log does not get cleared up. 
This causes gradual build up of HushableRandomAccessFileAppender causing HS2 to 
OOM after some time.

!image-2020-01-21-11-14-37-911.png|width=431,height=267!

!2Q==|width=487,height=204!

 

Prod instance mem

!Z|width=531,height=159!

 

Each HushableRandomAccessFileAppender holds internal ref to 
RandomAccessFileAppender which holds a 256 KB bytebuffer, causing the mem leak.

Related ticket: HIVE-18820



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22752) HiveMetastore addWriteNotificationLog should be invoked only when listeners are enabled

2020-01-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22752:
---

 Summary: HiveMetastore addWriteNotificationLog should be invoked 
only when listeners are enabled
 Key: HIVE-22752
 URL: https://issues.apache.org/jira/browse/HIVE-22752
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L8109]

 

Even though listeners are turned off, it gets executed and causes load on the 
system. This should be guarded by listener checks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22751) Move locking in HiveServer2::isDeregisteredWithZooKeeper to ZooKeeperHiveHelper

2020-01-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22751:
---

 Summary: Move locking in HiveServer2::isDeregisteredWithZooKeeper 
to ZooKeeperHiveHelper
 Key: HIVE-22751
 URL: https://issues.apache.org/jira/browse/HIVE-22751
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/server/HiveServer2.java#L620]

[https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/cli/session/SessionManager.java#L597]

 

When queries are run in beeline and closed, it causes unwanted delays in 
shutting down beeline.  Here is the threaddump from server side, which shows 
HiveServer2 lock contention.

 

It would be good to move synchronization to 
"zooKeeperHelper.isDeregisteredWithZooKeeper"

 
{noformat}
"main" #1 prio=5 os_prio=0 tid=0x7f78b0078800 nid=0x2d1c waiting on 
condition [0x7f78b968c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xac8d5ff0> (a 
java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPool.startUnderInitLock(TezSessionPool.java:187)
at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPool.start(TezSessionPool.java:123)
- locked <0xa9c5f2a8> (a java.lang.Object)
at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.startPool(TezSessionPoolManager.java:115)
at 
org.apache.hive.service.server.HiveServer2.initAndStartTezSessionPoolManager(HiveServer2.java:790)
at 
org.apache.hive.service.server.HiveServer2.startOrReconnectTezSessions(HiveServer2.java:763)
at 
org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:687)
- locked <0xa99bd568> (a 
org.apache.hive.service.server.HiveServer2)
at 
org.apache.hive.service.server.HiveServer2.startHiveServer2(HiveServer2.java:1016)
at 
org.apache.hive.service.server.HiveServer2.access$1400(HiveServer2.java:137)
at 
org.apache.hive.service.server.HiveServer2$StartOptionExecutor.execute(HiveServer2.java:1294)
at 
org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:1138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
"HiveServer2-HttpHandler-Pool: Thread-50" #50 prio=5 os_prio=0 
tid=0x7f78b3e60800 nid=0x2fa7 waiting for monitor entry [0x7f7884edf000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hive.service.server.HiveServer2.isDeregisteredWithZooKeeper(HiveServer2.java:600)
- waiting to lock <0xa99bd568> (a 
org.apache.hive.service.server.HiveServer2)
at 
org.apache.hive.service.cli.session.SessionManager.closeSessionInternal(SessionManager.java:631)
at 
org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:621)
- locked <0xaa1970b0> (a 
org.apache.hive.service.cli.session.SessionManager)
at 
org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:244)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:527)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1517)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1502)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TServlet.doPost(TServlet.java:83)
at 
org.apache.hive.service.cli.thrift.ThriftHttpServlet.doPost(ThriftHttpServlet.java:237)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 

[jira] [Created] (HIVE-22725) Lazy evaluate HiveMetastore::fireReadTablePreEvent table computation

2020-01-13 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22725:
---

 Summary: Lazy evaluate HiveMetastore::fireReadTablePreEvent table 
computation
 Key: HIVE-22725
 URL: https://issues.apache.org/jira/browse/HIVE-22725
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


"TransactionalValidationListener" gets added in the pre-event listeners of HMS 
by default. 

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L559]

This causes issue in short select queries, as table details are computed for 
any partition lookups.

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4984]

It would be good to lazy evaluate table lookup in this codepath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22724) ObjectStore: Reduce number of DB calls

2020-01-13 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22724:
---

 Summary: ObjectStore: Reduce number of DB calls
 Key: HIVE-22724
 URL: https://issues.apache.org/jira/browse/HIVE-22724
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22720) AuthenticationProviderFactory shouldn

2020-01-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22720:
---

 Summary: AuthenticationProviderFactory shouldn
 Key: HIVE-22720
 URL: https://issues.apache.org/jira/browse/HIVE-22720
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22719) Remove Log from HiveConf::getLogIdVar

2020-01-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22719:
---

 Summary: Remove Log from HiveConf::getLogIdVar
 Key: HIVE-22719
 URL: https://issues.apache.org/jira/browse/HIVE-22719
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: ExecuteStatemt_getResult.jpg

Log statement gets in the hotpath when executing large number of tiny sql 
statements.

!ExecuteStatemt_getResult.jpg|width=260,height=177!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22657) Add log message when stats have to to computed during calcite

2019-12-18 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22657:
---

 Summary: Add log message when stats have to to computed during 
calcite
 Key: HIVE-22657
 URL: https://issues.apache.org/jira/browse/HIVE-22657
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


When stats are not available, {[RelOptHiveTable::getColStat}} computes stats on 
the fly. However, it turns out to be a lot more slower in cloud.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22609) Reduce number of FS getFileStatus calls in AcidUtils::getHdfsDirSnapshots

2019-12-09 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22609:
---

 Summary: Reduce number of FS getFileStatus calls in 
AcidUtils::getHdfsDirSnapshots
 Key: HIVE-22609
 URL: https://issues.apache.org/jira/browse/HIVE-22609
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L1380]

ACID delta folder contains {{_orc_acid_version}} and {{bucket_0}} files. 
For both these files, parent dir is the same. Number of getFileStatus in such 
cases should be reduced by 1/2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22548) Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location

2019-11-26 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22548:
---

 Summary: Optimise Utilities.removeTempOrDuplicateFiles when moving 
files to final location
 Key: HIVE-22548
 URL: https://issues.apache.org/jira/browse/HIVE-22548
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Affects Versions: 3.1.2
Reporter: Rajesh Balamohan


{{Utilities.removeTempOrDuplicateFiles}}

is very slow with cloud storage, as it executes {{listStatus}} twice and also 
runs in single threaded mode.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22520) MS-SQL server: Load partition throws error in TxnHandler (ACID dataset)

2019-11-20 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22520:
---

 Summary: MS-SQL server: Load partition throws error in TxnHandler 
(ACID dataset)
 Key: HIVE-22520
 URL: https://issues.apache.org/jira/browse/HIVE-22520
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 3.1.2
Reporter: Rajesh Balamohan


When loading ACID table with MS-SQL server as backend, it ends up throwing 
following exception.

 
{noformat}
 thrift.ProcessFunction: Internal error processing add_dynamic_partitions
org.apache.hadoop.hive.metastore.api.MetaException: Unable to insert into from 
transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The 
incoming request has too many parameters. The server supports a maximum of 2100 
parameters. Reduce the number of parameters and resend the request.
at 
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:508)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7240)
at 
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2869)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:243)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:218)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate(SQLServerPreparedStatement.java:461)
at 
com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
at 
com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.addDynamicPartitions(TxnHandler.java:3149)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_dynamic_partitions(HiveMetaStore.java:7824)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
at com.sun.proxy.$Proxy32.add_dynamic_partitions(Unknown Source)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$add_dynamic_partitions.getResult(ThriftHiveMetastore.java:19038)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$add_dynamic_partitions.getResult(ThriftHiveMetastore.java:19022)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.metastore.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:48)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java#L3258
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22485) Cross product should set the conf in UnorderedPartitionedKVEdgeConfig

2019-11-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22485:
---

 Summary: Cross product should set the conf in 
UnorderedPartitionedKVEdgeConfig
 Key: HIVE-22485
 URL: https://issues.apache.org/jira/browse/HIVE-22485
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


SSL and other options would not be sent correctly, if this is not setup.

 

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L545



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22482) o.a.h.hive.q.i.AcidUtils.isInsertOnlyTable should not computed in FileSinkOperator for every record

2019-11-12 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22482:
---

 Summary: o.a.h.hive.q.i.AcidUtils.isInsertOnlyTable should not 
computed in FileSinkOperator for every record
 Key: HIVE-22482
 URL: https://issues.apache.org/jira/browse/HIVE-22482
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


 

 
{noformat}

at java.util.Hashtable.get(Hashtable.java:367)
- locked <0x0006f4827098> (a java.util.Properties)
at java.util.Properties.getProperty(Properties.java:969)
at 
org.apache.hadoop.hive.ql.io.AcidUtils.isInsertOnlyTable(AcidUtils.java:2104)
at 
org.apache.hadoop.hive.ql.plan.FileSinkDesc.isMmTable(FileSinkDesc.java:333)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.areAllTrue(FileSinkOperator.java:1047)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:966)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:128)
at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:152)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:552)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22465) Add ssl-client conf in TezConfigurationFactory

2019-11-06 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22465:
---

 Summary: Add ssl-client conf in TezConfigurationFactory
 Key: HIVE-22465
 URL: https://issues.apache.org/jira/browse/HIVE-22465
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22389) Repl: Optimise ReplDumpTask.incrementalDump

2019-10-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22389:
---

 Summary: Repl: Optimise ReplDumpTask.incrementalDump
 Key: HIVE-22389
 URL: https://issues.apache.org/jira/browse/HIVE-22389
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22388) Repl: Optimise ReplicationSemanticAnalyzer.analyzeReplLoad

2019-10-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22388:
---

 Summary: Repl: Optimise ReplicationSemanticAnalyzer.analyzeReplLoad
 Key: HIVE-22388
 URL: https://issues.apache.org/jira/browse/HIVE-22388
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: Rajesh Balamohan


Slow with 1000 tables in the DB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22386) Repl: Optimise ReplDumpTask::bootStrapDump

2019-10-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22386:
---

 Summary: Repl: Optimise ReplDumpTask::bootStrapDump
 Key: HIVE-22386
 URL: https://issues.apache.org/jira/browse/HIVE-22386
 Project: Hive
  Issue Type: Sub-task
Reporter: Rajesh Balamohan


{\{ReplDumpTask::bootStrapDump}} dumps one table at a time within a database. 
This data is written in separate folders per table. This can be optimized to 
write in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22387) Repl: Reduce FS lookups in repl bootstrap

2019-10-22 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-22387:
---

 Summary: Repl: Reduce FS lookups in repl bootstrap
 Key: HIVE-22387
 URL: https://issues.apache.org/jira/browse/HIVE-22387
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: Rajesh Balamohan


During bootstrap, \{{dbRoot}} is obtained per database. This need not be 
validated for every table dump (in \{{TableExport.Paths}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >