Re: Hive SQL extension

2020-10-22 Thread Stamatis Zampetakis
Hi Peter,

I am nowhere near being an expert but just wanted to share my thoughts.

If I understand correctly you would like some syntactic sugar in Hive to
support partitioning as per Iceberg. I cannot tell if that's really useful
or not but from my point of view it doesn't seem a very good idea to
introduce another layer of parsing before the actual parser (don't know if
there is one already). For instance, how are you gonna handle the situation
where there are syntax errors in your sugared part and what the end user
should see?

No matter how it is added if you give the possibility to the user to write
such queries it becomes part of the Hive syntax and as such a job of the
parser.

Best,
Stamatis


On Thu, Oct 22, 2020 at 9:49 AM Peter Vary  wrote:

> Hi Hive experts,
>
> I would like to extend Hive SQL language to provide a way to create
> Iceberg partitioned tables like this:
>
> create table iceberg_test(
> level string,
> event_time timestamp,
> message string,
> register_time date,
> telephone array 
> )
> partition by spec(
> level identity,
> event_time identity,
> event_time hour,
> register_time day
> )
> stored as iceberg;
>
>
> The problem is that this syntax is very specific of Iceberg, and I think
> it is not a good idea to change the Hive syntax globally to accommodate a
> specific use-case.
> The following CREATE TABLE statement could archive the same thing:
>
> create table iceberg_test(
> level string,
> event_time timestamp,
> message string,
> register_time date,
> telephone array 
> )
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
> TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>
>
> I am looking for a way to rewrite the original (Hive syntactically not
> correct) query to a new (syntactically correct) one.
>
> I was checking the hooks as a possible solution, but I have found that:
>
>- HiveDriverRunHook.preDriverRun can get the original / syntactically
>not correct query, but I have found no way to rewrite it to a syntactically
>correct one (it looks like a read only query)
>- HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>syntactically correct query to start with
>
>
> Any other ideas how to archive the goals above? Either with Hooks, or with
> any other way?
>
> Thanks,
> Peter
>


[jira] [Created] (HIVE-24252) Improve decision model for using semijoin reducers

2020-10-09 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24252:
--

 Summary: Improve decision model for using semijoin reducers
 Key: HIVE-24252
 URL: https://issues.apache.org/jira/browse/HIVE-24252
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


After a few experiments with TPC-DS 10TB dataset, we observed that in some 
cases semijoin reducers were not effective; they didn't reduce the number of 
records or they reduced the relation only a tiny bit. 

In some cases we can make the semijoin reducer more effective by adding more 
columns but this requires also a bigger bloom filter so the decision for the 
number of columns to include in the bloom becomes more delicate.

The current decision model always chooses multi-column semijoin reducers if 
they are available but this may not always beneficial if the a single column 
can reduce significantly the target relation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24251) Improve bloom filter size estimation for multi column semijoin reducers

2020-10-09 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24251:
--

 Summary: Improve bloom filter size estimation for multi column 
semijoin reducers
 Key: HIVE-24251
 URL: https://issues.apache.org/jira/browse/HIVE-24251
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


There are various cases where the expected size of the bloom filter is largely 
underestimated  making the semijoin reducer completely ineffective. This more 
relevant for multi-column semi join reducers since the current 
[code|https://github.com/apache/hive/blob/d61c9160ffa5afbd729887c3db690eccd7ef8238/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBloomFilter.java#L273]
 does not take them into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24221) Use vectorizable expression to combine multiple columns in semijoin bloom filters

2020-10-01 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24221:
--

 Summary: Use vectorizable expression to combine multiple columns 
in semijoin bloom filters
 Key: HIVE-24221
 URL: https://issues.apache.org/jira/browse/HIVE-24221
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
 Environment: 

Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Currently, multi-column semijoin reducers use an n-ary call to 
GenericUDFMurmurHash to combine multiple values into one, which is used as an 
entry to the bloom filter. However, there are no vectorized operators that 
treat n-ary inputs. The same goes for the vectorized implementation of 
GenericUDFMurmurHash introduced in HIVE-23976. 

The goal of this issue is to choose an alternative way to combine multiple 
values into one to pass in the bloom filter comprising only vectorized 
operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24180) 'hive.txn.heartbeat.threadpool.size' is deprecated in HiveConf with no alternative

2020-09-18 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24180:
--

 Summary: 'hive.txn.heartbeat.threadpool.size' is deprecated in 
HiveConf with no alternative
 Key: HIVE-24180
 URL: https://issues.apache.org/jira/browse/HIVE-24180
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


HiveConf.ConfVars#HIVE_TXN_HEARTBEAT_THREADPOOL_SIZE appears deprecated with 
javadoc pointing to MetastoreConf.TXN_HEARTBEAT_THREADPOOL_SIZE but there is no 
such configuration variable in MetastoreConf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24179) Memory leak in HS2 DbTxnManager when compiling SHOW LOCKS statement

2020-09-18 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24179:
--

 Summary: Memory leak in HS2 DbTxnManager when compiling SHOW LOCKS 
statement
 Key: HIVE-24179
 URL: https://issues.apache.org/jira/browse/HIVE-24179
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0
 Attachments: summary.png

The problem can be reproduced by executing repeatedly a SHOW LOCK statement and 
monitoring the heap memory of HS2. For a small heap (e.g., 2g) it only takes a 
few minutes before the server crashes with OutOfMemory error such as the one 
shown below.

{noformat}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.encodeMessage(ForkedChannelEncoder.j
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.setOutErr(ForkedChannelEncoder.java:
at 
org.apache.maven.surefire.booter.ForkedChannelEncoder.stdErr(ForkedChannelEncoder.java:166
at 
org.apache.maven.surefire.booter.ForkingRunListener.writeTestOutput(ForkingRunListener.jav
at 
org.apache.maven.surefire.report.ConsoleOutputCapture$ForwardingPrintStream.write(ConsoleO
at 
org.apache.logging.log4j.core.util.CloseShieldOutputStream.write(CloseShieldOutputStream.j
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStream
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager
at 
org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(Abst
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutp
at 
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputS
at 
org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:12
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(Appender
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at 
org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
at 
org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
at 
org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletio
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at 
org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2008)
at 
org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1867)
at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:179)
{noformat}

The heap dump shows (summary.png) that most of the memory is consumed by 
{{Hashtable$Entry}} and {{ConcurrentHashMap$Node}} objects coming from Hive 
configurations referenced by {{DbTxnManager}}. 

The latter are not eligible for garbage collection since at 
[construction|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java#L212]
 time they are passed implicitly in a callback  stored inside 
ShutdownHookManager.  

When the {{DbTxnManager}} is closed properly the leak is not present since the 
callback is 
[removed|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java#L882]
 from ShutdownHookManager. 

{{SHOW LOCKS}} statements create 
([ShowDbLocksAnalyzer|https://github.com/apache/hive/blob/975c832b6d069559c5b406a4aa8def3180fe4e75/ql/src/java/org/apache/hadoop/hive/ql/ddl/table/lock/show/ShowDbLocksAnalyzer.java#L52],
 
[ShowLocksAnalyzer|https://github.com/apache/hive/blob

[jira] [Created] (HIVE-24167) NPE in query 14 while generating plan for sub query predicate

2020-09-15 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24167:
--

 Summary: NPE in query 14 while generating plan for sub query 
predicate
 Key: HIVE-24167
 URL: https://issues.apache.org/jira/browse/HIVE-24167
 Project: Hive
  Issue Type: Bug
  Components: CBO
Reporter: Stamatis Zampetakis


TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
with the partitioned TPC-DS 30TB dataset while generating the plan for sub 
query predicate. 

The problem can be reproduced using the PR in HIVE-23965.

The current stacktrace shows that the NPE appears while trying to display the 
debug message but even if this line didn't exist it would fail again later on.

{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:710)
at 
org.apache.hadoop.hive.cli.control.CorePerfCliDriver.runTest(CorePerfCliDriver.java:103)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestTezTPCDS30TBPerfCliDriver.testCliDriver(TestTezTPCDS30TBPerfCliDriver.java:83)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24112) TestMiniLlapLocalCliDriver[dynamic_semijoin_reduction_on_aggcol] is flaky

2020-09-02 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24112:
--

 Summary: 
TestMiniLlapLocalCliDriver[dynamic_semijoin_reduction_on_aggcol] is flaky
 Key: HIVE-24112
 URL: https://issues.apache.org/jira/browse/HIVE-24112
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0


http://ci.hive.apache.org/job/hive-flaky-check/96/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24104) NPE due to null key columns in ReduceSink after deduplication

2020-09-01 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24104:
--

 Summary: NPE due to null key columns in ReduceSink after 
deduplication
 Key: HIVE-24104
 URL: https://issues.apache.org/jira/browse/HIVE-24104
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In some cases the {{ReduceSinkDeDuplication}} optimization creates ReduceSink 
operators where the key columns are null. This can lead to NPE in various 
places in the code. 

The following stracktrace shows an example where NPE is raised due to key 
columns being null.

{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.plan.ExprNodeDesc$ExprNodeDescEqualityWrapper.equals(ExprNodeDesc.java:141)
at java.util.AbstractList.equals(AbstractList.java:523)
at 
org.apache.hadoop.hive.ql.optimizer.SetReducerParallelism.process(SetReducerParallelism.java:101)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at 
org.apache.hadoop.hive.ql.lib.ForwardWalker.walk(ForwardWalker.java:74)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at 
org.apache.hadoop.hive.ql.parse.TezCompiler.runStatsDependentOptimizations(TezCompiler.java:492)
at 
org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:226)
at 
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:161)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12643)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:710)
at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:170)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild

[jira] [Created] (HIVE-24031) Infinite planning time on syntactically big queries

2020-08-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24031:
--

 Summary: Infinite planning time on syntactically big queries
 Key: HIVE-24031
 URL: https://issues.apache.org/jira/browse/HIVE-24031
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
 Fix For: 4.0.0


Syntactically big queries (~1 million tokens), such as the query shown below, 
lead to very big (seemingly infinite) planning times.

{code:sql}
select posexplode(array('item1', 'item2', ..., 'item1M'));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24018) Review necessity of AggregationDesc#setGenericUDAFWritableEvaluator for bloom filter aggregations

2020-08-07 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24018:
--

 Summary: Review necessity of 
AggregationDesc#setGenericUDAFWritableEvaluator for bloom filter aggregations
 Key: HIVE-24018
 URL: https://issues.apache.org/jira/browse/HIVE-24018
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Few places in the code have following pattern 
{code:java}
GenericUDAFBloomFilterEvaluator bloomFilterEval = new 
GenericUDAFBloomFilterEvaluator();
...
AggregationDesc bloom = new AggregationDesc("bloom_filter", bloomFilterEval, p, 
false, mode);
bloom.setGenericUDAFWritableEvaluator(bloomFilterEval);
{code}
where the bloom filter evaluator is passed in the constructor of the 
aggregation and  directly after using a setter. The use of the setter is 
necessary otherwise there are runtime failures of the query however the pattern 
is a bit confusing. 

Investigate if there is a way to avoid the double passing of the evaluator. 

To reproduce the failure remove the setter and run the following test.
{noformat}
mvn test -Dtest=TestMiniLlapLocalCliDriver 
-Dqfile=vectorized_dynamic_semijoin_reduction.q -Dtest.output.overwrite 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24016) Share bloom filter construction branch in multi column semijoin reducers

2020-08-07 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-24016:
--

 Summary: Share bloom filter construction branch in multi column 
semijoin reducers
 Key: HIVE-24016
 URL: https://issues.apache.org/jira/browse/HIVE-24016
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In HIVE-21196, we add a transformation capable of merging single column 
semijoin reducers to multi column semijoin reducer.

Currently it transforms the subplan SB0 to subplan SB1.

+SB0+
{noformat}
  / RS -> TS_1[Editor] 
 / SEL[fname] - GB - RS - GB -  RS -> TS_0[Author] 
 SOURCE 
 \ SEL[lname] - GB - RS - GB -  RS -> TS_0[Author]
  \ RS -> TS_1[Editor]

TS_0[Author] - FIL[in_bloom(fname) ^ in_bloom(lname)]
TS_1[Editor] - FIL[in_bloom(fname) ^ in_bloom(lname)]  
{noformat}

+SB1+
{noformat}
 / SEL[fname,lname] - GB - RS - GB - RS -> TS[Author] - 
FIL[in_bloom(hash(fname,lname))]
 SOURCE  
 \ SEL[fname,lname] - GB - RS - GB - RS -> TS[Editor] - 
FIL[in_bloom(hash(fname,lname))]
{noformat}

Observe that in SB1 we could share the common path that creates the bloom 
filter (SEL - GB - RS -GB) to obtain a plan like SB2.

+SB2+
{noformat}
   / RS -> TS[Author] - 
FIL[in_bloom(hash(fname,lname))]
 SOURCE - SEL[fname,lname] - GB - RS - GB -
   \ RS -> TS[Editor] - 
FIL[in_bloom(hash(fname,lname))]
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23999) Unify the code creating single and multi column semijoin reducers

2020-08-06 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23999:
--

 Summary: Unify the code creating single and multi column semijoin 
reducers
 Key: HIVE-23999
 URL: https://issues.apache.org/jira/browse/HIVE-23999
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


In HIVE-21196, we add a transformation capable of merging single column 
semijoin reducers to multi column semijoin reducer.

The code for creating multi-column semijoin reducers in SemiJoinReductionMerge 
presents some similarities with the code creating single-column semijoin 
reducers in DynamicPartitionPruningOptimization.

Possibly we could refactor the respective parts to unify the creation logic of 
semijoin reducers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23976) Enable vectorization for multi-col semi join reducers

2020-08-03 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23976:
--

 Summary: Enable vectorization for multi-col semi join reducers
 Key: HIVE-23976
 URL: https://issues.apache.org/jira/browse/HIVE-23976
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


HIVE-21196 introduces multi-column semi-join reducers in the query engine. 
However, the implementation relies on GenericUDFMurmurHash which is not 
vectorized thus the respective operators cannot be executed in vectorized mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Hive TPC-DS metastore dumps in Postgres

2020-07-31 Thread Stamatis Zampetakis
There is now a PR [1] with various improvements over the last update. Feel
free to check it out and let me know what you think.

Best,
Stamatis

[1] https://github.com/apache/hive/pull/1347

On Mon, Jun 22, 2020 at 5:32 PM Stamatis Zampetakis 
wrote:

> Hey guys,
>
> I put up a small project on GitHub [1] with Hive metastore dumps from
> tpcds10tb/tpcds30tb (+partitioning) and some scripts to quickly spin up a
> dockerized Postgres with those loaded.
>
> Personally, I find it useful to check the plans of TPC-DS queries using
> the usual qtest mechanism (without external tools and tapping into a real
> cluster) having at hand beefy stats + partitioning info. The driver and
> other changes needed to run these tests are located in [2].
>
> I am sharing it here in case it might be of use to somebody else.
>
> The two main commands that you will need if you wanna try this out:
> docker build --tag postgres-tpcds-metastore:1.0 .
> mvn test -Dtest=TestTezPerfDBCliDriver -Dtest.output.overwrite=true
> -Dtest.metastore.db=postgres.tpcds
>
> Small caveat: Currently in [2] the dockerized postgres is restarted for
> every query which makes things slow. This will be fixed later on.
>
> Best,
> Stamatis
>
> [1] https://github.com/zabetak/hive-postgres-metastore
> [2] https://github.com/zabetak/hive/tree/qtest_postgres_driver
>


[jira] [Created] (HIVE-23965) Improve plan regression tests using TPCDS30TB metastore dump and custom configs

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23965:
--

 Summary: Improve plan regression tests using TPCDS30TB metastore 
dump and custom configs
 Key: HIVE-23965
 URL: https://issues.apache.org/jira/browse/HIVE-23965
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis


The existing regression tests (HIVE-12586) based on TPC-DS have certain 
shortcomings:

The table statistics do not reflect cardinalities from a specific TPC-DS scale 
factor (SF). Some tables are from a 30TB dataset, others from 200GB dataset, 
and others from a 3GB dataset. This mix leads to plans that may never appear 
when using an actual TPC-DS dataset. 

The existing statistics do not contain information about partitions something 
that can have a big impact on the resulting plans.

The existing regression tests rely on more or less on the default configuration 
(hive-site.xml). In real-life scenarios though some of the configurations 
differ and may impact the choices of the optimizer.

This issue aims to address the above shortcomings by using a curated TPCDS30TB 
metastore dump along with some custom hive configurations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23964) SemanticException in query 30 while generating logical plan

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23964:
--

 Summary: SemanticException in query 30 while generating logical 
plan
 Key: HIVE-23964
 URL: https://issues.apache.org/jira/browse/HIVE-23964
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
 Attachments: cbo_query30_stacktrace.txt

Invalid table alias or column reference 'c_last_review_date' is thrown when  
running TPC-DS query 30 (cbo_query30.q, query30.q) on the metastore with the 
partitoned TPC-DS 30TB dataset. 

The respective stacktrace is attached to this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23963) UnsupportedOperationException in queries 74 and 84 while applying HiveCardinalityPreservingJoinRule

2020-07-31 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23963:
--

 Summary: UnsupportedOperationException in queries 74 and 84 while 
applying HiveCardinalityPreservingJoinRule
 Key: HIVE-23963
 URL: https://issues.apache.org/jira/browse/HIVE-23963
 Project: Hive
  Issue Type: Bug
  Components: CBO
Reporter: Stamatis Zampetakis
 Attachments: cbo_query74_stacktrace.txt, cbo_query84_stacktrace.txt

The following TPC-DS queries: 
* cbo_query74.q
* cbo_query84.q 
* query74.q 
* query84.q 

fail on the metastore with the partitioned TPC-DS 30TB dataset.

The stacktraces for cbo_query74 and cbo_query84 show that the problem 
originates while applying HiveCardinalityPreservingJoinRule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-07-29 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23946:
--

 Summary: Improve control flow and error handling in QTest dataset 
loading/unloading
 Key: HIVE-23946
 URL: https://issues.apache.org/jira/browse/HIVE-23946
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


This issue focuses mainly on the following methods:
[QTestDatasetHandler#initDataset| 
https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76]
[QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95]

related to QTest dataset loading and unloading.

The boolean return type in these methods is redundant since they either fail or 
return true (they never return false).

The methods should throw an Exception instead of an AssertionError to indicate 
failure. This allows code higher up the stack to perform proper recovery and 
properly report the failure. At the moment, if an AssertionError is raised from 
these methods dependent code (eg., 
[CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188])
 fails to notice that the query has failed. 

In case of failure in loading/unloading the environment (instance and class 
variables) is not properly cleaned leading to failures in all subsequent tests.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23940) Add TPCH tables (scale factor 0.001) as qt datasets

2020-07-27 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23940:
--

 Summary: Add TPCH tables (scale factor 0.001) as qt datasets
 Key: HIVE-23940
 URL: https://issues.apache.org/jira/browse/HIVE-23940
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Currently there are only two TPCH tables (lineitem, part) in qt datasets and 
the data do not reflect an actual scale factor. 

TPC-H schema is quite popular and having all tables is useful to create 
meaningful and understandable queries. 

Moreover, keeping the standard proportions allows to have query plans that are 
going to be meaningful when the scale factor changes and makes it easier to 
compare the correctness of the results against other databases.  

The goal of this issue is to add all TPCH tables with their data at scale 
factor 0.001 as qt datasets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23934) Refactor TezCompiler#markSemiJoinForDPP to avoid redundant operations in nested while

2020-07-26 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23934:
--

 Summary: Refactor TezCompiler#markSemiJoinForDPP to avoid 
redundant operations in nested while
 Key: HIVE-23934
 URL: https://issues.apache.org/jira/browse/HIVE-23934
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Most of the code inside the nested while loop can be extracted and computed 
only once in the external loop. Moreover there are catch clauses for NPE which 
seem rather predictable and could possibly be avoided by proper checks.  

The goal of this issue is to refactor TezCompiler#markSemiJoinForDPP method to 
avoid redundant operations and improve code readability. As a side effect of 
this refactoring the method will be slightly more efficient although unlikely 
to have observable difference in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23781) Incomplete partition column stats in CachedStore may lead to wrong aggregate stats

2020-06-30 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23781:
--

 Summary: Incomplete partition column stats in CachedStore may lead 
to wrong aggregate stats
 Key: HIVE-23781
 URL: https://issues.apache.org/jira/browse/HIVE-23781
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Requesting aggregate stats from the Metastore ({{RawStore#get_aggr_stats_for}}) 
may return wrong results when the backing implementation is CachedStore and 
column statistics are missing from the cache.
 
The suspicious code lies inside {{CachedStore#mergeColStatsForPartitions}} that 
returns an [empty 
object|https://github.com/apache/hive/blob/31ee14644bf6105360d6266baa8c6c8060d38ea3/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java#L2267]
 when no stats are found in the cache. This is considered a valid value by the 
consumer so no additional lookup is performed in the rawstore to fetch the 
actual values.

Moreover, in the case where the cache holds values for some partitions but not 
for all those requested the result will be wrong assuming that the underlying 
rawstore has information about the requested partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23768) Metastore's update service wrongly strips partition column stats from the cache

2020-06-26 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23768:
--

 Summary: Metastore's update service wrongly strips partition 
column stats from the cache
 Key: HIVE-23768
 URL: https://issues.apache.org/jira/browse/HIVE-23768
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Metastore's update service wrongly strips partition column stats from the cache 
in an attempt to update them. The issue may go unnoticed since missing stats do 
not lead to query failures. 

However, they can alter significantly the query plan affecting performance. 
Moreover, they lead to flakiness since some times the stats are present and 
sometimes are not leading to a query that has a different plan overtime. 

Normally missing elements from the cache shouldn't be a correctness problem 
since we can always fallback to the raw stats. Unfortunately, there are many 
interconnections with other parts of the code (e.g., code to obtain aggregate 
statistics) where this contract breaks.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: 【Why the NULL will be filterd in HQL】

2020-06-26 Thread Stamatis Zampetakis
Hello,

I think it would be easier to understand the problem if you have a query at
hand:

SELECT id FROM author WHERE fname != 'Victor'

0 | Victor
1 | null
2 | Alex

The query should return 2 in every standard compliant SQL database.

Victor != Victor evaluates to FALSE
null != Victor evaluates to UNKNOWN
Alex != Victor evaluates to TRUE

The WHERE clause removes tuples for which the condition evaluates to FALSE
or UNKNOWN and this is the normal behavior.

Best,
Stamatis


On Fri, Jun 26, 2020 at 2:36 AM 忝忝向仧 <153488...@qq.com> wrote:

> Hi,all:
>
>
> I want to know why the hive needs to filter the NULL when use '<' or
> '!=' in HQL?
> Normally,in Oracle or other Databases the NULL will not be filtered when
> using '<' or '!='.
> This could be a JIRA?
> Thanks!


[jira] [Created] (HIVE-23742) Remove unintentional execution of TPC-DS query39 in qtests

2020-06-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23742:
--

 Summary: Remove unintentional execution of TPC-DS query39 in qtests
 Key: HIVE-23742
 URL: https://issues.apache.org/jira/browse/HIVE-23742
 Project: Hive
  Issue Type: Task
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


TPC-DS queries under clientpositive/perf are meant only to check plan 
regressions so they should never be really executed thus the execution part 
should be removed from query39.q and cbo_query39.q



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Hive TPC-DS metastore dumps in Postgres

2020-06-22 Thread Stamatis Zampetakis
Hey guys,

I put up a small project on GitHub [1] with Hive metastore dumps from
tpcds10tb/tpcds30tb (+partitioning) and some scripts to quickly spin up a
dockerized Postgres with those loaded.

Personally, I find it useful to check the plans of TPC-DS queries using the
usual qtest mechanism (without external tools and tapping into a real
cluster) having at hand beefy stats + partitioning info. The driver and
other changes needed to run these tests are located in [2].

I am sharing it here in case it might be of use to somebody else.

The two main commands that you will need if you wanna try this out:
docker build --tag postgres-tpcds-metastore:1.0 .
mvn test -Dtest=TestTezPerfDBCliDriver -Dtest.output.overwrite=true
-Dtest.metastore.db=postgres.tpcds

Small caveat: Currently in [2] the dockerized postgres is restarted for
every query which makes things slow. This will be fixed later on.

Best,
Stamatis

[1] https://github.com/zabetak/hive-postgres-metastore
[2] https://github.com/zabetak/hive/tree/qtest_postgres_driver


Re: HIVE building on ARM

2020-06-18 Thread Stamatis Zampetakis
Hello Chinna,

The hudson-jobadmin privilege can be granted by PMC chairs.
I don't know if there is any particular policy in Hive on who should have
this privilege so I guess you should request it from Ashutosh.

Best,
Stamatis

On Thu, Jun 18, 2020 at 12:05 PM Zoltan Haindrich  wrote:

> Hey Chinna!
>
> On 6/18/20 11:43 AM, Chinna Rao Lalam wrote:
> > As you said, migrating this job to the new ci-hadoop instance looks good
> as
> > Hadoop also shares the same armN slaves.
>
> Sounds great!
>
> > I am able to login the new ci-hadoop instance with Apache LDAP
> credentials,
> > but i am not able to see the job creation option. Should I request access
> > or the process for creation of a job is different than jenkin?.
> > Please guide me to create the new job in the ci-hadoop instance. I will
> > migrate this job after connecting the armN slaves to the new system.
>
>
> I've also logged in - and apparently I've create job rights; I'm happy to
> help, but the best would be to self-service yourselft :)
> I think you may miss the "hudson-jobadmin" privilege.
> Probably Gavin (or someone on the infra team) could help you with that..
> to talk to them quickly - you can reach them on the #asfinfra channel (on
> the asf-slack).
>
> The migration effort is coordinated thru the hadoop-migrations mailing
> list (I've cc-ed that list)
> you may want to subscribe to it by sending a mail to:
> hadoop-migrations-subscr...@infra.apache.org
>
> cheers,
> Zoltan
>
>
>
> >
> > Thanks
> > Chinna
> >
> > On Wed, Jun 17, 2020 at 11:57 AM Zhenyu Zheng  >
> > wrote:
> >
> >> Hi Zoltan,
> >>
> >> Thanks alot for the information, so looks like one possible solution is
> as
> >> you suggest, move the current ARM2 and ARM3 (those two were donate to
> >> builds.apache.org by us) to the new ci-hadoop cluster and set up the
> jobs
> >> just as what has been done in current jenkins.
> >>
> >> I will also ask our team member works on other projects to find out what
> >> the status of other projects is.
> >>
> >> BR,
> >>
> >> On Tue, Jun 16, 2020 at 6:41 PM Zoltan Haindrich  wrote:
> >>
> >>> Hey,
> >>>
> >>> There is an effort by the Apache Infra to change the way Jenkins stuff
> is
> >>> organized; a couple months ago Gavin wrote an email about it:
> >>>
> >>>
> http://mail-archives.apache.org/mod_mbox/tez-dev/202004.mbox/%3ccan0gg1dodepzatjz9bofe-2ver7qg7h0hmvyjmsldgjr8_r...@mail.gmail.com%3E
> >>> The resources for running these jobs are coming from the H0~H21 slaves
> >>> which will be migrated to the new jenkins master eventually.
> >>>
> >>>   >> So please
> >>>   >> suggest a way which direction we can move and can you share some
> >>> details
> >>>   >> about the new ci-hadoop instance.
> >>>
> >>> Since Hadoop testing is also happening on ARM - I think the best would
> be
> >>> to also migrate the armN slaves and the Hive arm nightly over to the
> new
> >>> ci-hadoop instance.
> >>>
> >>> On 6/16/20 8:40 AM, Zhenyu Zheng wrote:
>  Thanks for the info, I wonder if where does the resource of ci-hadoop
> >>> and
>  hive-test-kube come from? Do they include ARM resources?
> >>>
> >>> Interesting question; the resources for Hive testing are donated by
> >>> Cloudera.
> >>> About the ARM workers I think Chinna could provide more details.
> >>> ...I've no idea don't know who sponsors the Hxx slaves
> >>>
>  Can you provide some more information about how the new hive-test-kube
> >>> is
>  running?
> >>> It's basically a Jenkins instance which is using kubernetes pods to run
> >>> things.
> >>> The whole thing is running on a GKE cluster.
> >>> While I was working on it I collected stuff needed for it in this repo:
> >>> https://github.com/kgyrtkirk/hive-test-kube/
> >>> it should be possible to start a new deployment using that stuff
> >>>
> >>> cheers,
> >>> Zoltan
> >>>
> 
>  BR,
>  Kevin Zheng
> 
>  On Tue, Jun 16, 2020 at 12:41 PM Chinna Rao Lalam <
>  lalamchinnara...@gmail.com> wrote:
> 
> > Hi Zoltan,
> >
> > Thanks for the update.
> >
> > Current https://builds.apache.org/job/Hive-linux-ARM-trunk/ job is
> > targeting to run hive tests daily on "arm" slaves, it is using 2 arm
> > slaves.
> > To find any potential issues with "arm" and fix the issues. So please
> > suggest a way which direction we can move and can you share some
> >>> details
> > about the new ci-hadoop instance.
> >
> > Thanks,
> > Chinna
> >
> > On Mon, Jun 15, 2020 at 3:56 PM Zoltan Haindrich 
> wrote:
> >
> >> Hey all,
> >>
> >> In an ticket (INFRA-20416) Gavin asked me if we are completely off
> >> builds.apache.org - when I went over the jobs I've saw that
> >> https://builds.apache.org/job/Hive-linux-ARM-trunk/ is running
> there
> >> once a day.
> >>
> >> Since builds.apache.org will be shut down in sometime in the future
> >>> - we
> >> should move this job to the new ci-hadoop instance or to
> >>> 

[jira] [Created] (HIVE-23684) Large underestimation in NDV stats when input and join cardinality ratio is big

2020-06-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23684:
--

 Summary: Large underestimation in NDV stats when input and join 
cardinality ratio is big
 Key: HIVE-23684
 URL: https://issues.apache.org/jira/browse/HIVE-23684
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Large underestimations of NDV values may occur after a join operation since the 
current logic will decrease the original NDV values proportionally.

The 
[code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
 compares the number of rows of each relation before the join with the number 
of rows after the join and extracts a ratio for each side. Based on this ratio 
it adapts (reduces) the NDV accordingly.

Consider for instance the following query:
{code:sql}
select inv_warehouse_sk
 , inv_item_sk
 , stddev_samp(inv_quantity_on_hand) stdev
 , avg(inv_quantity_on_hand) mean
from inventory
   , date_dim
where inv_date_sk = d_date_sk
  and d_year = 1999
  and d_moy = 2
group by inv_warehouse_sk, inv_item_sk;
{code}
For the sake of the discussion, I outline below some relevant stats (from 
TPCDS30tb):
 T(inventory) = 1627857000
 T(date_dim) = 73049
 T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
 V(inventory, inv_date_sk) = 261
 V(inventory, inv_item_sk) = 42
 V(inventory, inv_warehouse_sk) = 27
 V(date_dim, inv, d_date_sk) = 73049

For instance, in this query the join between inventory and date_dim has ~24M 
rows while inventory has ~1.5B so the NDV of the columns coming from inventory 
are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K 
while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Replace ptest with hive-test-kube

2020-06-05 Thread Stamatis Zampetakis
+1 for failing fast starting with findbugs and eventually covering the
important points from checkstyle.

Bes,
Stamatis

On Fri, Jun 5, 2020 at 9:35 AM Zoltan Haindrich  wrote:

>
> Hey Mustafa!
>
> Those checks are not executed anymore in the new system. I always feeled
> it a bit confusing to have a comment which reports about
> checkstyle/finbugs/etc issues; while
> getting a green test run was almost impossible due to the high number of
> randomly falling tests.
> So I don't think it's viable that someone will re-submit the patch with
> style changes.
>
> I think the old approach is a "soft" way of enforcing code quality - in
> which I personally don't believe: code quality should be enforced by
> rules/quality gates/etc.
>
> So I would like to take a different approach...I think we should definetly
> re-introduce these checks - however without "tolerance" being built-in.
> This will most likely
> mean that we will have to soften the ruleset first; but then we may
> gradually increase the bar to a higher level.
>
> The "without tolerance" will mean that this will be checked during (or
> right after) the build phase - so if you make quality mistakes you will
> just not get a test results
> (it will also save resources).
>
> Yesterday Laszlo have opened a ticket about fixing findbugs issues - if we
> do fix those issues; but we never enforce to fail the build - someone might
> just add a few more.
>
> To increase code quality through out the project I think we could take a
> bottom-up approach:
> * first patch:
>* fix things in a low level module(like common or storage-api)
>* it should also add the neccessary maven changes to enforce things
> during precommit (up-to that module)
> * followups:
>* raise the bar to higher level modules
>
> Obviously we can't do this for something like checkstyle which detects a
> myriad of small issues:
> * the ruleset should be shrinked to something which needs reasonable
> amount of work to start enforcing
> * later we can enable further rules/fix all of them in the project
>
> What do you think about this?
>
> cheers,
> Zoltan
>
>
>
> On 6/5/20 2:47 AM, Mustafa IMAN wrote:
> > Thank you Zoltan for all this work.
> > I see many PRs are merged based on the new workflow already. The old
> > workflow generates many reports like ASF license/findbugs/checkstyle
> etc. I
> > don't see these in the new Github PR workflow. I am concerned the
> codebase
> > is going to suffer from lack of these reports very quickly. Do these
> checks
> > still happen but are not visible?
> >
> > On Tue, Jun 2, 2020 at 4:41 AM Zoltan Haindrich  wrote:
> >
> >> Hello,
> >>
> >> I would like to note that you may login to the jenkins instance - to
> >> start/kill builds (or create new jobs).
> >> I've configured github oauth - but since team membership can't be
> queried
> >> from the "apache organization" - it's harder to configure all "hive
> >> committers".
> >> However...I think I've made it available for most of us - if you can't
> >> start builds/etc just let me know your github user and I'll add it.
> >>
> >> cheers,
> >> Zoltan
> >>
> >>
> >>
> >> On 5/29/20 2:32 PM, Zoltan Haindrich wrote:
> >>> Hey all!
> >>>
> >>> The patch is now in master - so every new PR or a push on it will
> >> trigger a new run.
> >>>
> >>> Please decide which one would you like to use - open a PR to see the
> new
> >> one work...or upload a patch file to the jira - but please don't do
> both;
> >> because in that case 2
> >>> execution will happen.
> >>>
> >>> The job execution time(2-4 hours) of a single run is a bit higher than
> >> the usual on the ptest server - this is mostly to increase throughput.
> >>>
> >>> The patch also disabled a set of tests; I will send the full list of
> >> skipped tests shortly.
> >>>
> >>> cheers,
> >>> Zoltan
> >>>
> >>>
> >>> On 5/27/20 1:50 PM, Zoltan Haindrich wrote:
>  Hello all!
> 
>  The new stuff is ready to be switched on-to. It needs to be merged
> into
> >> master - and after that anyone who opens a PR will get a run by the new
> >> HiveQA infra.
>  I propose to run the 2 systems side-by-side for some time - the
> regular
> >> master builds will start; and we will see how frequently that is
> polluted
> >> by flaky tests.
> 
>  Note that the current patch also disables around ~25 more tests to
> >> increase stability - to get a better overview about the disabled tests I
> >> think the "direction of the
>  information flow" should be altered; what I mean by that is: instead
> of
> >> just throwing in a jira for "disable test x" and opening a new one like
> >> "fix test x"; only open
>  the latter and place the jira reference into the ignore message;
> >> meanwhile also add a regular report about the actually disabled tests -
> so
> >> people who do know about the
>  importance of a particular test can get involved.
> 
>  Note: the builds.apache.org instance will be shutdown somewhere in
> the
> >> future as 

Re: [DISCUSS] Disable ptest job

2020-06-05 Thread Stamatis Zampetakis
Hi Zoltan,

The sooner we move away from the old system the better. It will also help
to detect and solve faster any kind of problems with the new approach if
there are more people using it.

Also it will be cool to have junit5 :D

Best,
Stamatis


On Fri, Jun 5, 2020 at 10:44 AM Zoltan Haindrich  wrote:

> Hey all!
>
> So far I've seen only 1 issue with the new system: there were 2 occurences
> in the last week when a build was affected by some kind a kubernetes issue
> which have taken down
> an executor - the logs pointed to some kind of kubelet issue; since the
> GKE master have upgraded to 1.16 while the node pools were still running
> 1.15 it could have been the
> cause of it. Yesterday I've upgraded all the node-pools.
>
> Because I've seen that people now sometimes open both a PR and upload a
> patch to the jira as well; I would like to propose to disable the PTest job
> on builds.apache.org on
> Monday.
>
> Note: This will also unblock to apply the junit5 patch - and could open up
> the possibility to sometimes exclude a set of tests from execution - the
> total test execution
> time is around 24 hours - from which 8 hours is spent running replication
> tests. Since most changes will touch replication stuff they could be made
> optional.
>
> cheers,
> Zoltan
>
>


Re: Open old PRs

2020-06-02 Thread Stamatis Zampetakis
Hello,

I am very happy working with the new system. Many thanks Zoltan!

I find the bot a good idea and I think its worth trying it out.
One thing to watch out is the case where contributors are willing to push
their work forward but there are no available reviewers to look to each
case.
I think people will reply to the bot once or twice but I don't think they
will do it much longer so we could take this into account for the
configuration of the bot.

Regarding merge squash option there might be a small caveat. I don't know
if it is possible to retain the information about the person who performed
the merge.
According to the discussion in [1] it seems that the committer in this case
will appear to be the GitHub account.
This might not be a big problem for Hive since the reviewer's name is part
of the commit message so the credit and responsibility is not lost.

Best,
Stamatis

[1] https://github.com/isaacs/github/issues/1303



On Tue, Jun 2, 2020 at 9:26 PM Zoltan Haindrich  wrote:

>
>
> On 6/2/20 9:15 PM, David Mollitor wrote:
> > I use a personal account for GitHub and it's not synced with my official
> > Apache account.  How do I go about registering my Apache account with
> > GitHub so I can merge through their interface?
>
> IIRC I've linked my account by using this interface:
> https://gitbox.apache.org/setup/
>
> >
> > In the meanwhile, can you assist with a merge here? :)
> >
>
> sure; I think you should also add dmolli...@apache.org as a secondary
> email to your github account
>
> About the open pr stuff: I still think our best approach of handling those
> things would be to close most of that 400 or so PRs...easiest would be to
> install the bot (at
> least temporarily)
> https://issues.apache.org/jira/browse/HIVE-23590
> what do you think?
>
> cheers,
> Zoltan
>
>
> > https://github.com/apache/hive/pull/1045
> >
> > Thanks!
> >
> > On Tue, Jun 2, 2020 at 10:21 AM Zoltan Haindrich  wrote:
> >
> >>
> >>
> >> On 6/2/20 3:10 PM, David Mollitor wrote:
> >>> I think we might want to take one manual pass across the board.  It
> will
> >>> most likely take more than 7 days to get through them all, so it may be
> >>> closing things that are legitimate.
> >>
> >> yeah...a manual pass would be good; I went thru around 10 or so before
> >> I've wrote the first mail in this thread...
> >> and I definetly don't want to go thru 400 - so I would preffer the bot
> :D
> >>
> >>>
> >>> One low hanging fruit (that applied to one of my PRs).  The JIRA it was
> >>> associated with was already closed.  Is there a way to target those?
> >>
> >> yes; there might be certainly a lot of those...(that's why I've estimate
> >> to 1/3 to be applicable)
> >> but filtering out even this is an awful lot of work (or it might involve
> >> writing a "bot")...
> >> if it's important enough the contributor could reopen / rebase the
> patch.
> >> We could try to communicate the non-hostaile intention in the message
> >> placed by the bot.
> >> The current message is the stale PRs would get is:
> >> "This pull request has been automatically marked as stale because it has
> >> not had recent activity. It will be closed if no further activity
> occurs."
> >>
> >>> Also, I have submitted my first PR to test out the new system.  It
> >>> has passed tests.  Ashutoshc has generously provided a +1.  What's the
> >>> next step to get it merged into the master?  Do I download the patch
> from
> >>> Github and apply manually using my Apache credentials?  Is the "merge"
> >>> feature setup in Github?  As I understand it, GitHub is only mirroring
> >> the
> >>> Apache git system.  Whatever the process we need an update in the
> >>> HowToContribute docs.
> >>
> >> That's an interesting question; the github repo is linked to the apache
> >> repo - so you may push/merge/whatever on the github interface; it will
> work.
> >> Github supports 3 modes to merge PRs:
> >> * We should definetly disable the "merge" option as that will just
> create
> >> a internation railways station from our history :)
> >> * rebase doesn't make it easier for reviewier to keep track new
> >> changes...because the PR owner have to continuosly force push the branch
> >> * squash merge work great - and I remembered that it changes the author
> to
> >> the user pushing the "squash" button; however right now it seems that it
> >> changes the author to
> >> the "user who opened the pr" which looks good-enough for me!
> >> (I've added the neccessary .asf.yaml changes to the existing PR)
> >>
> >> cheers,
> >> Zoltan
> >>
> >>> https://github.com/apache/hive/pull/1045
> >>>
> >>
> https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-ApplyingaPatch
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> On Tue, Jun 2, 2020 at 4:58 AM Zoltan Haindrich  wrote:
> >>>
>  I think to use "probot" we would need to ask infra to configure the
>  "probot" github app.
>  It seems to me that the stale plugin from github actions provides
> almost
>  the same 

[jira] [Created] (HIVE-23534) NPE in RetryingMetaStoreClient#invoke when catching MetaException with no message

2020-05-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23534:
--

 Summary: NPE in RetryingMetaStoreClient#invoke when catching 
MetaException with no message
 Key: HIVE-23534
 URL: https://issues.apache.org/jira/browse/HIVE-23534
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


RetryingMetaStoreClient#invoke method catches MetaException and attempts to 
classify it by checking the message. However there are cases (e.g., various 
places in 
[ObjectStore|https://github.com/apache/hive/blob/716f1f9a945a9a11e6702754667660d27e0a5cf4/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L3916])
 where the message of the MetaException is null and this leads to NPE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23532) NPE when fetching incomplete column statistics from the metastore

2020-05-22 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23532:
--

 Summary: NPE when fetching incomplete column statistics from the 
metastore
 Key: HIVE-23532
 URL: https://issues.apache.org/jira/browse/HIVE-23532
 Project: Hive
  Issue Type: Bug
Reporter: Stamatis Zampetakis


Certain operations may store in the metastore incomplete column statistics.  
Fetching those statistics back from the metastore leads to 
{{NullPointerException}} .

For instance consider a column "name" of type string. If we do have statistics 
for this column then the following info must be available:
* maxColLen; 
* avgColLen; 
* numNulls; 
* numDVs; 

Executing the following statement on a table with no stats updates a subset of 
the statistics for this column:

{code:sql}
ALTER TABLE example UPDATE STATISTICS for column name SET ('numDVs'='242', 
'numNulls'='5');
{code}

Fetching this kind of statistics leads to NPE that sometimes pops up in the 
client and some other times is buried in the logs leading to incomplete column 
stats during optimization and execution of a query.

Usually the stacktrace is similar to the one below:
{noformat}
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.metadata.Hive.getTableColumnStatistics(Hive.java:5251)
at 
org.apache.hadoop.hive.ql.ddl.table.info.desc.DescTableOperation.getColumnDataColPathSpecified(DescTableOperation.java:216)
at 
org.apache.hadoop.hive.ql.ddl.table.info.desc.DescTableOperation.execute(DescTableOperation.java:94)
at org.apache.hadoop.hive.ql.ddl.DDLTask.execute(DDLTask.java:80)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:362)
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:335)
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:723)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:492)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:486)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:164)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:230)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:256)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:353)
at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:730)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:700)
at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:170)
at 
org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at 
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunne

[jira] [Created] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2020-05-17 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23485:
--

 Summary: Bound GroupByOperator stats using largest NDV among 
columns
 Key: HIVE-23485
 URL: https://issues.apache.org/jira/browse/HIVE-23485
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Consider the following SQL query:

{code:sql}
select id, name from person group by id, name;
{code}

and assume that the person table contains the following tuples:

{code:sql}
insert into person values (0, 'A') ;
insert into person values (1, 'A') ;
insert into person values (2, 'B') ;
insert into person values (3, 'B') ;
insert into person values (4, 'B') ;
insert into person values (5, 'C') ;
{code}

If we know the number of distinct values (NDV) for all columns in the group by 
clause then we can infer a lower bound for the total number of rows by taking 
the maximun NDV of the involved columns. 

Currently the query in the scenario above has the following plan:

{noformat}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
limit:-1
Stage-1
  Reducer 2 vectorized
  File Output Operator [FS_11]
Group By Operator [GBY_10] (rows=3 width=92)
  Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
<-Map 1 [SIMPLE_EDGE] vectorized
  SHUFFLE [RS_9]
PartitionCols:_col0, _col1
Group By Operator [GBY_8] (rows=3 width=92)
  Output:["_col0","_col1"],keys:id, name
  Select Operator [SEL_7] (rows=6 width=92)
Output:["id","name"]
TableScan [TS_0] (rows=6 width=92)
  
default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}

Observe that the stats for group by report 3 rows but given that the ID 
attribute is part of the aggregation the rows cannot be less than 6.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23479) Avoid regenerating JdbcSchema for every table in a query

2020-05-15 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23479:
--

 Summary: Avoid regenerating JdbcSchema for every table in a query
 Key: HIVE-23479
 URL: https://issues.apache.org/jira/browse/HIVE-23479
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Reporter: Stamatis Zampetakis


Currently {{CalcitePlanner}} generates a complete {{JdbcSchema}} for every 
{{JdbcTable}} in the query.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3174

This wastes some resources since every call to {{JdbcSchema#getTable}} needs to 
communicate with the database to bring back the tables belonging to the schema. 
Moreover, the fact that a schema is created during planning is 
counter-intuitive since in principle the schema shouldn't change.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23456) Upgrade Calcite version to 1.23.0

2020-05-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23456:
--

 Summary: Upgrade Calcite version to 1.23.0
 Key: HIVE-23456
 URL: https://issues.apache.org/jira/browse/HIVE-23456
 Project: Hive
  Issue Type: Task
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23453) IntelliJ compile errors in StaticPermanentFunctionChecker and TestVectorGroupByOperator

2020-05-12 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23453:
--

 Summary: IntelliJ compile errors in StaticPermanentFunctionChecker 
and TestVectorGroupByOperator
 Key: HIVE-23453
 URL: https://issues.apache.org/jira/browse/HIVE-23453
 Project: Hive
  Issue Type: Bug
  Components: Hive
 Environment: IntelliJ IDEA 2020.1.1 built 201.7223.91
jdk 1.8.0_251
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


The following errors appear when compiling the code using IntelliJ:

TestVectorGroupByOperator: Error:(89, 32) java: package 
com.sun.tools.javac.util does not exist

StaticPermanentFunctionChecker: Error:(31, 19) java: package com.sun.jdi does 
not exist



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: java.lang.IllegalAccessError:tried to access method com.goole.common.collect.Iterators.empty()Lcom/google/commmon/collect/UnmodifiableIterator;from class org.apache.hadoop.hive.ql.exec.FetchOperat

2020-05-09 Thread Stamatis Zampetakis
Hello,

According to the release page [1] Hive 2.3.3 works with Hadoop 2.x.y (not
3.x.y) so if you want to run with Hadoop 3.2.1 try a newer version.

Other than that the error looks like a classpath problem related with
guava. I guess you have one Guava version coming from Hive and another
version coming from Hadoop. Try removing one of them. For instance:
> cd apache-hive-2.3.3-bin/lib
> rm guava*

Even if you solve the problem above most likely you will bump up into
another so it is better to choose versions that are compatible.

Best,
Stamatis

[1] https://hive.apache.org/downloads.html

On Sat, May 9, 2020 at 6:40 AM qq <987626...@qq.com> wrote:

> Hello:
>   The version of hive is 2.3.3, and the version of Hadoop is 3.2.1.
> When I execute show tables on the command line of hive beeline, the
> following error occurs:
> Error message in picture
> How can I solve it?
>
>  thinks.
>  I am looking forward to your reply!
>
>