[jira] [Created] (HIVE-24252) Improve decision model for using semijoin reducers

2020-10-09 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24252:
--

 Summary: Improve decision model for using semijoin reducers
 Key: HIVE-24252
 URL: https://issues.apache.org/jira/browse/HIVE-24252
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


After a few experiments with TPC-DS 10TB dataset, we observed that in some 
cases semijoin reducers were not effective; they didn't reduce the number of 
records or they reduced the relation only a tiny bit. 

In some cases we can make the semijoin reducer more effective by adding more 
columns but this requires also a bigger bloom filter so the decision for the 
number of columns to include in the bloom becomes more delicate.

The current decision model always chooses multi-column semijoin reducers if 
they are available but this may not always beneficial if the a single column 
can reduce significantly the target relation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24251) Improve bloom filter size estimation for multi column semijoin reducers

2020-10-09 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24251:
--

 Summary: Improve bloom filter size estimation for multi column 
semijoin reducers
 Key: HIVE-24251
 URL: https://issues.apache.org/jira/browse/HIVE-24251
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


There are various cases where the expected size of the bloom filter is largely 
underestimated  making the semijoin reducer completely ineffective. This more 
relevant for multi-column semi join reducers since the current 
[code|https://github.com/apache/hive/blob/d61c9160ffa5afbd729887c3db690eccd7ef8238/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBloomFilter.java#L273]
 does not take them into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24221) Use vectorizable expression to combine multiple columns in semijoin bloom filters

2020-10-01 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24221:
--

 Summary: Use vectorizable expression to combine multiple columns 
in semijoin bloom filters
 Key: HIVE-24221
 URL: https://issues.apache.org/jira/browse/HIVE-24221
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
 Environment: 

Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Currently, multi-column semijoin reducers use an n-ary call to 
GenericUDFMurmurHash to combine multiple values into one, which is used as an 
entry to the bloom filter. However, there are no vectorized operators that 
treat n-ary inputs. The same goes for the vectorized implementation of 
GenericUDFMurmurHash introduced in HIVE-23976. 

The goal of this issue is to choose an alternative way to combine multiple 
values into one to pass in the bloom filter comprising only vectorized 
operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24180) 'hive.txn.heartbeat.threadpool.size' is deprecated in HiveConf with no alternative

2020-09-18 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24180:
--

 Summary: 'hive.txn.heartbeat.threadpool.size' is deprecated in 
HiveConf with no alternative
 Key: HIVE-24180
 URL: https://issues.apache.org/jira/browse/HIVE-24180
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


HiveConf.ConfVars#HIVE_TXN_HEARTBEAT_THREADPOOL_SIZE appears deprecated with 
javadoc pointing to MetastoreConf.TXN_HEARTBEAT_THREADPOOL_SIZE but there is no 
such configuration variable in MetastoreConf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24179) Memory leak in HS2 DbTxnManager when compiling SHOW LOCKS statement

2020-09-18 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24179:
--

 Summary: Memory leak in HS2 DbTxnManager when compiling SHOW LOCKS 
statement
 Key: HIVE-24179
 URL: https://issues.apache.org/jira/browse/HIVE-24179
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0
 Attachments: summary.png

The problem can be reproduced by executing repeatedly a SHOW LOCK statement and 
monitoring the heap memory of HS2. For a small heap (e.g., 2g) it only takes a 
few minutes before the server crashes with OutOfMemory error such as the one 
shown below.

{noformat}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.encodeMessage(ForkedChannelEncoder.j
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.setOutErr(ForkedChannelEncoder.java:
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.stdErr(ForkedChannelEncoder.java:166
at 
org.apache.maven.surefire.booter.ForkingRunListener.writeTestOutput(ForkingRunListener.jav
at 
org.apache.maven.surefire.report.ConsoleOutputCapture$ForwardingPrintStream.write(ConsoleO
at 
org.apache.logging.log4j.core.util.CloseShieldOutputStream.write(CloseShieldOutputStream.j
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStream
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(Abst
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutp
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputS
at 
org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:12
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(Appender
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at 
org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
at 
org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
at 
org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletio
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at 
org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2008)
at 
org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1867)
at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:179)
{noformat}

The heap dump shows (summary.png) that most of the memory is consumed by 
{{Hashtable$Entry}} and {{ConcurrentHashMap$Node}} objects coming from Hive 
configurations referenced by {{DbTxnManager}}. 

The latter are not eligible for garbage collection since at 
[construction|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java#L212]
 time they are passed implicitly in a callback  stored inside 
ShutdownHookManager.  

When the {{DbTxnManager}} is closed properly the leak is not present since the 
callback is 
[removed|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java#L882]
 from ShutdownHookManager. 

{{SHOW LOCKS}} statements create 
([ShowDbLocksAnalyzer|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/ddl/table/lock/show/ShowDbLocksAnalyzer.java#L52],
 

[jira] [Created] (HIVE-24167) NPE in query 14 while generating plan for sub query predicate

2020-09-15 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24167:
--

 Summary: NPE in query 14 while generating plan for sub query 
predicate
 Key: HIVE-24167
 URL: https://issues.apache.org/jira/browse/HIVE-24167
 Project: Hive
  Issue Type: Bug
  Components: CBO
Reporter: Stamatis Zampetakis


TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
with the partitioned TPC-DS 30TB dataset while generating the plan for sub 
query predicate. 

The problem can be reproduced using the PR in HIVE-23965.

The current stacktrace shows that the NPE appears while trying to display the 
debug message but even if this line didn't exist it would fail again later on.

{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:710)
at 
org.apache.hadoop.hive.cli.control.CorePerfCliDriver.runTest(CorePerfCliDriver.java:103)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestTezTPCDS30TBPerfCliDriver.testCliDriver(TestTezTPCDS30TBPerfCliDriver.java:83)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24112) TestMiniLlapLocalCliDriver[dynamic_semijoin_reduction_on_aggcol] is flaky

2020-09-02 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24112:
--

 Summary: 
TestMiniLlapLocalCliDriver[dynamic_semijoin_reduction_on_aggcol] is flaky
 Key: HIVE-24112
 URL: https://issues.apache.org/jira/browse/HIVE-24112
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0


http://ci.hive.apache.org/job/hive-flaky-check/96/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24104) NPE due to null key columns in ReduceSink after deduplication

2020-09-01 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24104:
--

 Summary: NPE due to null key columns in ReduceSink after 
deduplication
 Key: HIVE-24104
 URL: https://issues.apache.org/jira/browse/HIVE-24104
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In some cases the {{ReduceSinkDeDuplication}} optimization creates ReduceSink 
operators where the key columns are null. This can lead to NPE in various 
places in the code. 

The following stracktrace shows an example where NPE is raised due to key 
columns being null.

{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.plan.ExprNodeDesc$ExprNodeDescEqualityWrapper.equals(ExprNodeDesc.java:141)
at java.util.AbstractList.equals(AbstractList.java:523)
at 
org.apache.hadoop.hive.ql.optimizer.SetReducerParallelism.process(SetReducerParallelism.java:101)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at 
org.apache.hadoop.hive.ql.lib.ForwardWalker.walk(ForwardWalker.java:74)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at 
org.apache.hadoop.hive.ql.parse.TezCompiler.runStatsDependentOptimizations(TezCompiler.java:492)
at 
org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:226)
at 
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:161)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12643)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:710)
at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:170)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 

[jira] [Created] (HIVE-24031) Infinite planning time on syntactically big queries

2020-08-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24031:
--

 Summary: Infinite planning time on syntactically big queries
 Key: HIVE-24031
 URL: https://issues.apache.org/jira/browse/HIVE-24031
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0


Syntactically big queries (~1 million tokens), such as the query shown below, 
lead to very big (seemingly infinite) planning times.

{code:sql}
select posexplode(array('item1', 'item2', ..., 'item1M'));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24018) Review necessity of AggregationDesc#setGenericUDAFWritableEvaluator for bloom filter aggregations

2020-08-07 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24018:
--

 Summary: Review necessity of 
AggregationDesc#setGenericUDAFWritableEvaluator for bloom filter aggregations
 Key: HIVE-24018
 URL: https://issues.apache.org/jira/browse/HIVE-24018
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Few places in the code have following pattern 
{code:java}
GenericUDAFBloomFilterEvaluator bloomFilterEval = new 
GenericUDAFBloomFilterEvaluator();
...
AggregationDesc bloom = new AggregationDesc("bloom_filter", bloomFilterEval, p, 
false, mode);
bloom.setGenericUDAFWritableEvaluator(bloomFilterEval);
{code}
where the bloom filter evaluator is passed in the constructor of the 
aggregation and  directly after using a setter. The use of the setter is 
necessary otherwise there are runtime failures of the query however the pattern 
is a bit confusing. 

Investigate if there is a way to avoid the double passing of the evaluator. 

To reproduce the failure remove the setter and run the following test.
{noformat}
mvn test -Dtest=TestMiniLlapLocalCliDriver 
-Dqfile=vectorized_dynamic_semijoin_reduction.q -Dtest.output.overwrite 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24016) Share bloom filter construction branch in multi column semijoin reducers

2020-08-07 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24016:
--

 Summary: Share bloom filter construction branch in multi column 
semijoin reducers
 Key: HIVE-24016
 URL: https://issues.apache.org/jira/browse/HIVE-24016
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In HIVE-21196, we add a transformation capable of merging single column 
semijoin reducers to multi column semijoin reducer.

Currently it transforms the subplan SB0 to subplan SB1.

+SB0+
{noformat}
  / RS -> TS_1[Editor] 
 / SEL[fname] - GB - RS - GB -  RS -> TS_0[Author] 
 SOURCE 
 \ SEL[lname] - GB - RS - GB -  RS -> TS_0[Author]
  \ RS -> TS_1[Editor]

TS_0[Author] - FIL[in_bloom(fname) ^ in_bloom(lname)]
TS_1[Editor] - FIL[in_bloom(fname) ^ in_bloom(lname)]  
{noformat}

+SB1+
{noformat}
 / SEL[fname,lname] - GB - RS - GB - RS -> TS[Author] - 
FIL[in_bloom(hash(fname,lname))]
 SOURCE  
 \ SEL[fname,lname] - GB - RS - GB - RS -> TS[Editor] - 
FIL[in_bloom(hash(fname,lname))]
{noformat}

Observe that in SB1 we could share the common path that creates the bloom 
filter (SEL - GB - RS -GB) to obtain a plan like SB2.

+SB2+
{noformat}
   / RS -> TS[Author] - 
FIL[in_bloom(hash(fname,lname))]
 SOURCE - SEL[fname,lname] - GB - RS - GB -
   \ RS -> TS[Editor] - 
FIL[in_bloom(hash(fname,lname))]
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23999) Unify the code creating single and multi column semijoin reducers

2020-08-06 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23999:
--

 Summary: Unify the code creating single and multi column semijoin 
reducers
 Key: HIVE-23999
 URL: https://issues.apache.org/jira/browse/HIVE-23999
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In HIVE-21196, we add a transformation capable of merging single column 
semijoin reducers to multi column semijoin reducer.

The code for creating multi-column semijoin reducers in SemiJoinReductionMerge 
presents some similarities with the code creating single-column semijoin 
reducers in DynamicPartitionPruningOptimization.

Possibly we could refactor the respective parts to unify the creation logic of 
semijoin reducers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23976) Enable vectorization for multi-col semi join reducers

2020-08-03 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23976:
--

 Summary: Enable vectorization for multi-col semi join reducers
 Key: HIVE-23976
 URL: https://issues.apache.org/jira/browse/HIVE-23976
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


HIVE-21196 introduces multi-column semi-join reducers in the query engine. 
However, the implementation relies on GenericUDFMurmurHash which is not 
vectorized thus the respective operators cannot be executed in vectorized mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23965) Improve plan regression tests using TPCDS30TB metastore dump and custom configs

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23965:
--

 Summary: Improve plan regression tests using TPCDS30TB metastore 
dump and custom configs
 Key: HIVE-23965
 URL: https://issues.apache.org/jira/browse/HIVE-23965
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis


The existing regression tests (HIVE-12586) based on TPC-DS have certain 
shortcomings:

The table statistics do not reflect cardinalities from a specific TPC-DS scale 
factor (SF). Some tables are from a 30TB dataset, others from 200GB dataset, 
and others from a 3GB dataset. This mix leads to plans that may never appear 
when using an actual TPC-DS dataset. 

The existing statistics do not contain information about partitions something 
that can have a big impact on the resulting plans.

The existing regression tests rely on more or less on the default configuration 
(hive-site.xml). In real-life scenarios though some of the configurations 
differ and may impact the choices of the optimizer.

This issue aims to address the above shortcomings by using a curated TPCDS30TB 
metastore dump along with some custom hive configurations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23964) SemanticException in query 30 while generating logical plan

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23964:
--

 Summary: SemanticException in query 30 while generating logical 
plan
 Key: HIVE-23964
 URL: https://issues.apache.org/jira/browse/HIVE-23964
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
 Attachments: cbo_query30_stacktrace.txt

Invalid table alias or column reference 'c_last_review_date' is thrown when  
running TPC-DS query 30 (cbo_query30.q, query30.q) on the metastore with the 
partitoned TPC-DS 30TB dataset. 

The respective stacktrace is attached to this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23963) UnsupportedOperationException in queries 74 and 84 while applying HiveCardinalityPreservingJoinRule

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23963:
--

 Summary: UnsupportedOperationException in queries 74 and 84 while 
applying HiveCardinalityPreservingJoinRule
 Key: HIVE-23963
 URL: https://issues.apache.org/jira/browse/HIVE-23963
 Project: Hive
  Issue Type: Bug
  Components: CBO
Reporter: Stamatis Zampetakis
 Attachments: cbo_query74_stacktrace.txt, cbo_query84_stacktrace.txt

The following TPC-DS queries: 
* cbo_query74.q
* cbo_query84.q 
* query74.q 
* query84.q 

fail on the metastore with the partitioned TPC-DS 30TB dataset.

The stacktraces for cbo_query74 and cbo_query84 show that the problem 
originates while applying HiveCardinalityPreservingJoinRule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-07-29 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23946:
--

 Summary: Improve control flow and error handling in QTest dataset 
loading/unloading
 Key: HIVE-23946
 URL: https://issues.apache.org/jira/browse/HIVE-23946
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


This issue focuses mainly on the following methods:
[QTestDatasetHandler#initDataset| 
https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76]
[QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95]

related to QTest dataset loading and unloading.

The boolean return type in these methods is redundant since they either fail or 
return true (they never return false).

The methods should throw an Exception instead of an AssertionError to indicate 
failure. This allows code higher up the stack to perform proper recovery and 
properly report the failure. At the moment, if an AssertionError is raised from 
these methods dependent code (eg., 
[CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188])
 fails to notice that the query has failed. 

In case of failure in loading/unloading the environment (instance and class 
variables) is not properly cleaned leading to failures in all subsequent tests.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23940) Add TPCH tables (scale factor 0.001) as qt datasets

2020-07-27 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23940:
--

 Summary: Add TPCH tables (scale factor 0.001) as qt datasets
 Key: HIVE-23940
 URL: https://issues.apache.org/jira/browse/HIVE-23940
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Currently there are only two TPCH tables (lineitem, part) in qt datasets and 
the data do not reflect an actual scale factor. 

TPC-H schema is quite popular and having all tables is useful to create 
meaningful and understandable queries. 

Moreover, keeping the standard proportions allows to have query plans that are 
going to be meaningful when the scale factor changes and makes it easier to 
compare the correctness of the results against other databases.  

The goal of this issue is to add all TPCH tables with their data at scale 
factor 0.001 as qt datasets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23934) Refactor TezCompiler#markSemiJoinForDPP to avoid redundant operations in nested while

2020-07-26 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23934:
--

 Summary: Refactor TezCompiler#markSemiJoinForDPP to avoid 
redundant operations in nested while
 Key: HIVE-23934
 URL: https://issues.apache.org/jira/browse/HIVE-23934
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Most of the code inside the nested while loop can be extracted and computed 
only once in the external loop. Moreover there are catch clauses for NPE which 
seem rather predictable and could possibly be avoided by proper checks.  

The goal of this issue is to refactor TezCompiler#markSemiJoinForDPP method to 
avoid redundant operations and improve code readability. As a side effect of 
this refactoring the method will be slightly more efficient although unlikely 
to have observable difference in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23781) Incomplete partition column stats in CachedStore may lead to wrong aggregate stats

2020-06-30 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23781:
--

 Summary: Incomplete partition column stats in CachedStore may lead 
to wrong aggregate stats
 Key: HIVE-23781
 URL: https://issues.apache.org/jira/browse/HIVE-23781
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Requesting aggregate stats from the Metastore ({{RawStore#get_aggr_stats_for}}) 
may return wrong results when the backing implementation is CachedStore and 
column statistics are missing from the cache.
 
The suspicious code lies inside {{CachedStore#mergeColStatsForPartitions}} that 
returns an [empty 
object|https://github.com/apache/hive/blob/31ee14644bf6105360d6266baa8c6c8060d38ea3/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java#L2267]
 when no stats are found in the cache. This is considered a valid value by the 
consumer so no additional lookup is performed in the rawstore to fetch the 
actual values.

Moreover, in the case where the cache holds values for some partitions but not 
for all those requested the result will be wrong assuming that the underlying 
rawstore has information about the requested partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23768) Metastore's update service wrongly strips partition column stats from the cache

2020-06-26 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23768:
--

 Summary: Metastore's update service wrongly strips partition 
column stats from the cache
 Key: HIVE-23768
 URL: https://issues.apache.org/jira/browse/HIVE-23768
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Metastore's update service wrongly strips partition column stats from the cache 
in an attempt to update them. The issue may go unnoticed since missing stats do 
not lead to query failures. 

However, they can alter significantly the query plan affecting performance. 
Moreover, they lead to flakiness since some times the stats are present and 
sometimes are not leading to a query that has a different plan overtime. 

Normally missing elements from the cache shouldn't be a correctness problem 
since we can always fallback to the raw stats. Unfortunately, there are many 
interconnections with other parts of the code (e.g., code to obtain aggregate 
statistics) where this contract breaks.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23742) Remove unintentional execution of TPC-DS query39 in qtests

2020-06-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23742:
--

 Summary: Remove unintentional execution of TPC-DS query39 in qtests
 Key: HIVE-23742
 URL: https://issues.apache.org/jira/browse/HIVE-23742
 Project: Hive
  Issue Type: Task
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


TPC-DS queries under clientpositive/perf are meant only to check plan 
regressions so they should never be really executed thus the execution part 
should be removed from query39.q and cbo_query39.q



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23684) Large underestimation in NDV stats when input and join cardinality ratio is big

2020-06-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23684:
--

 Summary: Large underestimation in NDV stats when input and join 
cardinality ratio is big
 Key: HIVE-23684
 URL: https://issues.apache.org/jira/browse/HIVE-23684
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Large underestimations of NDV values may occur after a join operation since the 
current logic will decrease the original NDV values proportionally.

The 
[code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
 compares the number of rows of each relation before the join with the number 
of rows after the join and extracts a ratio for each side. Based on this ratio 
it adapts (reduces) the NDV accordingly.

Consider for instance the following query:
{code:sql}
select inv_warehouse_sk
 , inv_item_sk
 , stddev_samp(inv_quantity_on_hand) stdev
 , avg(inv_quantity_on_hand) mean
from inventory
   , date_dim
where inv_date_sk = d_date_sk
  and d_year = 1999
  and d_moy = 2
group by inv_warehouse_sk, inv_item_sk;
{code}
For the sake of the discussion, I outline below some relevant stats (from 
TPCDS30tb):
 T(inventory) = 1627857000
 T(date_dim) = 73049
 T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
 V(inventory, inv_date_sk) = 261
 V(inventory, inv_item_sk) = 42
 V(inventory, inv_warehouse_sk) = 27
 V(date_dim, inv, d_date_sk) = 73049

For instance, in this query the join between inventory and date_dim has ~24M 
rows while inventory has ~1.5B so the NDV of the columns coming from inventory 
are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K 
while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23534) NPE in RetryingMetaStoreClient#invoke when catching MetaException with no message

2020-05-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23534:
--

 Summary: NPE in RetryingMetaStoreClient#invoke when catching 
MetaException with no message
 Key: HIVE-23534
 URL: https://issues.apache.org/jira/browse/HIVE-23534
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


RetryingMetaStoreClient#invoke method catches MetaException and attempts to 
classify it by checking the message. However there are cases (e.g., various 
places in 
[ObjectStore|https://github.com/apache/hive/blob/716f1f9a945a9a11e6702754667660d27e0a5cf4/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L3916])
 where the message of the MetaException is null and this leads to NPE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23532) NPE when fetching incomplete column statistics from the metastore

2020-05-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23532:
--

 Summary: NPE when fetching incomplete column statistics from the 
metastore
 Key: HIVE-23532
 URL: https://issues.apache.org/jira/browse/HIVE-23532
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis


Certain operations may store in the metastore incomplete column statistics.  
Fetching those statistics back from the metastore leads to 
{{NullPointerException}} .

For instance consider a column "name" of type string. If we do have statistics 
for this column then the following info must be available:
* maxColLen; 
* avgColLen; 
* numNulls; 
* numDVs; 

Executing the following statement on a table with no stats updates a subset of 
the statistics for this column:

{code:sql}
ALTER TABLE example UPDATE STATISTICS for column name SET ('numDVs'='242', 
'numNulls'='5');
{code}

Fetching this kind of statistics leads to NPE that sometimes pops up in the 
client and some other times is buried in the logs leading to incomplete column 
stats during optimization and execution of a query.

Usually the stacktrace is similar to the one below:
{noformat}
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.metadata.Hive.getTableColumnStatistics(Hive.java:5251)
at 
org.apache.hadoop.hive.ql.ddl.table.info.desc.DescTableOperation.getColumnDataColPathSpecified(DescTableOperation.java:216)
at 
org.apache.hadoop.hive.ql.ddl.table.info.desc.DescTableOperation.execute(DescTableOperation.java:94)
at org.apache.hadoop.hive.ql.ddl.DDLTask.execute(DDLTask.java:80)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:362)
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:335)
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:723)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:492)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:486)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:164)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:230)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:256)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:353)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:730)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:700)
at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:170)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
 

[jira] [Created] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2020-05-17 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23485:
--

 Summary: Bound GroupByOperator stats using largest NDV among 
columns
 Key: HIVE-23485
 URL: https://issues.apache.org/jira/browse/HIVE-23485
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Consider the following SQL query:

{code:sql}
select id, name from person group by id, name;
{code}

and assume that the person table contains the following tuples:

{code:sql}
insert into person values (0, 'A') ;
insert into person values (1, 'A') ;
insert into person values (2, 'B') ;
insert into person values (3, 'B') ;
insert into person values (4, 'B') ;
insert into person values (5, 'C') ;
{code}

If we know the number of distinct values (NDV) for all columns in the group by 
clause then we can infer a lower bound for the total number of rows by taking 
the maximun NDV of the involved columns. 

Currently the query in the scenario above has the following plan:

{noformat}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
limit:-1
Stage-1
  Reducer 2 vectorized
  File Output Operator [FS_11]
Group By Operator [GBY_10] (rows=3 width=92)
  Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
<-Map 1 [SIMPLE_EDGE] vectorized
  SHUFFLE [RS_9]
PartitionCols:_col0, _col1
Group By Operator [GBY_8] (rows=3 width=92)
  Output:["_col0","_col1"],keys:id, name
  Select Operator [SEL_7] (rows=6 width=92)
Output:["id","name"]
TableScan [TS_0] (rows=6 width=92)
  
default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}

Observe that the stats for group by report 3 rows but given that the ID 
attribute is part of the aggregation the rows cannot be less than 6.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23479) Avoid regenerating JdbcSchema for every table in a query

2020-05-15 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23479:
--

 Summary: Avoid regenerating JdbcSchema for every table in a query
 Key: HIVE-23479
 URL: https://issues.apache.org/jira/browse/HIVE-23479
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Reporter: Stamatis Zampetakis


Currently {{CalcitePlanner}} generates a complete {{JdbcSchema}} for every 
{{JdbcTable}} in the query.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3174

This wastes some resources since every call to {{JdbcSchema#getTable}} needs to 
communicate with the database to bring back the tables belonging to the schema. 
Moreover, the fact that a schema is created during planning is 
counter-intuitive since in principle the schema shouldn't change.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23456) Upgrade Calcite version to 1.23.0

2020-05-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23456:
--

 Summary: Upgrade Calcite version to 1.23.0
 Key: HIVE-23456
 URL: https://issues.apache.org/jira/browse/HIVE-23456
 Project: Hive
  Issue Type: Task
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23453) IntelliJ compile errors in StaticPermanentFunctionChecker and TestVectorGroupByOperator

2020-05-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23453:
--

 Summary: IntelliJ compile errors in StaticPermanentFunctionChecker 
and TestVectorGroupByOperator
 Key: HIVE-23453
 URL: https://issues.apache.org/jira/browse/HIVE-23453
 Project: Hive
  Issue Type: Bug
  Components: Hive
 Environment: IntelliJ IDEA 2020.1.1 built 201.7223.91
jdk 1.8.0_251
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


The following errors appear when compiling the code using IntelliJ:

TestVectorGroupByOperator: Error:(89, 32) java: package 
com.sun.tools.javac.util does not exist

StaticPermanentFunctionChecker: Error:(31, 19) java: package com.sun.jdi does 
not exist



--
This message was sent by Atlassian Jira
(v8.3.4#803005)