[jira] [Created] (HIVE-23941) Refactor TypeCheckProcFactory to be database agnostic

2020-07-27 Thread Steve Carlin (Jira)
Steve Carlin created HIVE-23941:
---

 Summary: Refactor TypeCheckProcFactory to be database agnostic
 Key: HIVE-23941
 URL: https://issues.apache.org/jira/browse/HIVE-23941
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Steve Carlin


Part of the code has already been refactored to become database agnostic (i.e. 
HiveFunctionHelper).  



Further refactoring needs to be done on TypeCheckProcFactory which also should 
be database agnostic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Time to Remove Hive-on-Spark

2020-07-27 Thread David
Hello  Xuefu,

I am not part of the Cloudera Hive product team,  though I volunteer to
work on small projects from time to time.  Perhaps someone from that team
can chime in with some of their thoughts, but personally, I think that in
the long run, there will be more of a merge between Hive-on-Spark and other
Spark-native offerings.  I'm not sure what the differentiation will be
going forward.  With that said, are there any developers on this mailing
list who are willing to take on the maintenance effort of keeping HoS
moving forward?

http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html


Thanks.

On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:

> Previous reasoning seemed to suggest a lack of user adoption. Now we are
> concerned about ongoing maintenance effort. Both are valid considerations.
> However, I think we should have ways to find out the answers. Therefore, I
> suggest the following be carried out:
>
> 1. Send out the proposal (removing Hive on Spark) to users including
> u...@hive.apache.org and get their feedback.
> 2. Ask if any developers on this mailing list are willing to take on the
> maintenance effort.
>
> I'm concerned about user impact because I can still see issues being
> reported on HoS from time to time. I'm more concerned about the future of
> Hive if we narrow Hive neutrality on execution engines, which will possibly
> force more Hive users to migrate to other alternatives such as Spark SQL,
> which is already eroding Hive's user base.
>
> Being open and neutral used to be Hive's most admired strengths.
>
> Thanks,
> Xuefu
>
>
> On Wed, Jul 22, 2020 at 8:46 AM Alan Gates  wrote:
>
> > An important point here is I don't believe David is proposing to remove
> > Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing to
> > support it in existing 2 and 3 lines makes sense, but since no one has
> > maintained it on trunk for some time and it does not work with many of
> the
> > newer features it should be removed from trunk.
> >
> > Alan.
> >
> > On Tue, Jul 21, 2020 at 4:10 PM Chao Sun  wrote:
> >
> > > Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
> very
> > > large scale in production right now and I don't think we have any plan
> to
> > > change it soon.
> > >
> > >
> > >
> > > On Tue, Jul 21, 2020 at 11:28 AM David  wrote:
> > >
> > > > Hello,
> > > >
> > > > Thanks for the feedback.
> > > >
> > > > Just a quick recap: I did propose this @dev and I received unanimous
> > +1's
> > > > from the community.  After a couple months, I created the PR.
> > > >
> > > > Certainly open to discussion, but there hasn't been any discussion
> thus
> > > far
> > > > because there have been no objections until this point.
> > > >
> > > > HoS has low adoption, heavy technical debt, and the manner in which
> its
> > > > build process is setup is impeding some other work that is not even
> > > related
> > > > to HoS.
> > > >
> > > > We can deprecate in Hive 3.x and remove in Hive 4.x.  The plan would
> be
> > > to
> > > > use Tez moving forward.
> > > >
> > > > My point about the vendor's move to Tez is that HoS adoption is very
> > low,
> > > > it's only going lower, and while I don't know the specifics of it,
> > there
> > > > must be some migration plan in place there (i.e., it must be possible
> > to
> > > do
> > > > it already).
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > > On Tue, Jul 21, 2020 at 12:23 PM Xuefu Zhang 
> wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > While a vendor may not support a component in an open source
> project,
> > > > > removing it or not is a decision by and for the community. I
> > certainly
> > > > > understand that the vendor you mentioned has contributed a great
> deal
> > > > > (including my personal effort while working there), it's not up to
> > the
> > > > > vendor to make a call like what is proposed here.
> > > > >
> > > > > As a community, we should have gone through a thorough discussion
> and
> > > > > reached a consensus before actually making such a big change, in my
> > > > > opinion.
> > > > >
> > > > > Thanks,
> > > > > Xuefu
> > > > >
> > > > > On Tue, Jul 21, 2020 at 8:49 AM David  wrote:
> > > > >
> > > > > > Hey,
> > > > > >
> > > > > > Thanks for the input.
> > > > > >
> > > > > > FYI. Cloudera (Cloudera + Hortonworks) have removed HoS from
> their
> > > > latest
> > > > > > offering.
> > > > > >
> > > > > > "Tez is now the only supported execution engine, existing queries
> > > that
> > > > > > change execution mode to Spark or MapReduce within a session, for
> > > > > example,
> > > > > > fail."
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.cloudera.com/cdp/latest/upgrade-post/topics/ug_hive_configuration_changes.html
> > > > > >
> > > > > >
> > > > > > So I don't know who will be supporting this feature moving
> forward,
> > > but
> > 

[jira] [Created] (HIVE-23940) Add TPCH tables (scale factor 0.001) as qt datasets

2020-07-27 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23940:
--

 Summary: Add TPCH tables (scale factor 0.001) as qt datasets
 Key: HIVE-23940
 URL: https://issues.apache.org/jira/browse/HIVE-23940
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Currently there are only two TPCH tables (lineitem, part) in qt datasets and 
the data do not reflect an actual scale factor. 

TPC-H schema is quite popular and having all tables is useful to create 
meaningful and understandable queries. 

Moreover, keeping the standard proportions allows to have query plans that are 
going to be meaningful when the scale factor changes and makes it easier to 
compare the correctness of the results against other databases.  

The goal of this issue is to add all TPCH tables with their data at scale 
factor 0.001 as qt datasets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23939) SharedWorkOptimizer: taking the union of columns in mergeable TableScans

2020-07-27 Thread Krisztian Kasa (Jira)
Krisztian Kasa created HIVE-23939:
-

 Summary: SharedWorkOptimizer: taking the union of columns in 
mergeable TableScans
 Key: HIVE-23939
 URL: https://issues.apache.org/jira/browse/HIVE-23939
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Reporter: Krisztian Kasa
Assignee: Krisztian Kasa


{code}
POSTHOOK: query: explain
select case when (select count(*) 
  from store_sales 
  where ss_quantity between 1 and 20) > 409437
then (select avg(ss_ext_list_price) 
  from store_sales 
  where ss_quantity between 1 and 20) 
else (select avg(ss_net_paid_inc_tax)
  from store_sales
  where ss_quantity between 1 and 20) end bucket1 ,
   case when (select count(*)
  from store_sales
  where ss_quantity between 21 and 40) > 4595804
then (select avg(ss_ext_list_price)
  from store_sales
  where ss_quantity between 21 and 40) 
else (select avg(ss_net_paid_inc_tax)
  from store_sales
  where ss_quantity between 21 and 40) end bucket2,
   case when (select count(*)
  from store_sales
  where ss_quantity between 41 and 60) > 7887297
then (select avg(ss_ext_list_price)
  from store_sales
  where ss_quantity between 41 and 60)
else (select avg(ss_net_paid_inc_tax)
  from store_sales
  where ss_quantity between 41 and 60) end bucket3,
   case when (select count(*)
  from store_sales
  where ss_quantity between 61 and 80) > 10872978
then (select avg(ss_ext_list_price)
  from store_sales
  where ss_quantity between 61 and 80)
else (select avg(ss_net_paid_inc_tax)
  from store_sales
  where ss_quantity between 61 and 80) end bucket4,
   case when (select count(*)
  from store_sales
  where ss_quantity between 81 and 100) > 43571537
then (select avg(ss_ext_list_price)
  from store_sales
  where ss_quantity between 81 and 100)
else (select avg(ss_net_paid_inc_tax)
  from store_sales
  where ss_quantity between 81 and 100) end bucket5
from reason
where r_reason_sk = 1
POSTHOOK: type: QUERY
POSTHOOK: Input: default@reason
POSTHOOK: Input: default@store_sales
POSTHOOK: Output: hdfs://### HDFS PATH ###
Plan optimized by CBO.

Vertex dependency in root stage
Reducer 10 <- Reducer 34 (CUSTOM_SIMPLE_EDGE), Reducer 9 (CUSTOM_SIMPLE_EDGE)
Reducer 11 <- Reducer 10 (CUSTOM_SIMPLE_EDGE), Reducer 18 (CUSTOM_SIMPLE_EDGE)
Reducer 12 <- Reducer 11 (CUSTOM_SIMPLE_EDGE), Reducer 24 (CUSTOM_SIMPLE_EDGE)
Reducer 13 <- Reducer 12 (CUSTOM_SIMPLE_EDGE), Reducer 30 (CUSTOM_SIMPLE_EDGE)
Reducer 14 <- Reducer 13 (CUSTOM_SIMPLE_EDGE), Reducer 19 (CUSTOM_SIMPLE_EDGE)
Reducer 15 <- Reducer 14 (CUSTOM_SIMPLE_EDGE), Reducer 25 (CUSTOM_SIMPLE_EDGE)
Reducer 16 <- Reducer 15 (CUSTOM_SIMPLE_EDGE), Reducer 31 (CUSTOM_SIMPLE_EDGE)
Reducer 18 <- Map 17 (CUSTOM_SIMPLE_EDGE)
Reducer 19 <- Map 17 (CUSTOM_SIMPLE_EDGE)
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE), Reducer 20 (CUSTOM_SIMPLE_EDGE)
Reducer 20 <- Map 17 (CUSTOM_SIMPLE_EDGE)
Reducer 21 <- Map 17 (CUSTOM_SIMPLE_EDGE)
Reducer 22 <- Map 17 (CUSTOM_SIMPLE_EDGE)
Reducer 24 <- Map 23 (CUSTOM_SIMPLE_EDGE)
Reducer 25 <- Map 23 (CUSTOM_SIMPLE_EDGE)
Reducer 26 <- Map 23 (CUSTOM_SIMPLE_EDGE)
Reducer 27 <- Map 23 (CUSTOM_SIMPLE_EDGE)
Reducer 28 <- Map 23 (CUSTOM_SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 26 (CUSTOM_SIMPLE_EDGE)
Reducer 30 <- Map 29 (CUSTOM_SIMPLE_EDGE)
Reducer 31 <- Map 29 (CUSTOM_SIMPLE_EDGE)
Reducer 32 <- Map 29 (CUSTOM_SIMPLE_EDGE)
Reducer 33 <- Map 29 (CUSTOM_SIMPLE_EDGE)
Reducer 34 <- Map 29 (CUSTOM_SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE), Reducer 32 (CUSTOM_SIMPLE_EDGE)
Reducer 5 <- Reducer 21 (CUSTOM_SIMPLE_EDGE), Reducer 4 (CUSTOM_SIMPLE_EDGE)
Reducer 6 <- Reducer 27 (CUSTOM_SIMPLE_EDGE), Reducer 5 (CUSTOM_SIMPLE_EDGE)
Reducer 7 <- Reducer 33 (CUSTOM_SIMPLE_EDGE), Reducer 6 (CUSTOM_SIMPLE_EDGE)
Reducer 8 <- Reducer 22 (CUSTOM_SIMPLE_EDGE), Reducer 7 (CUSTOM_SIMPLE_EDGE)
Reducer 9 <- Reducer 28 (CUSTOM_SIMPLE_EDGE), Reducer 8 (CUSTOM_SIMPLE_EDGE)

Stage-0
  Fetch Operator
limit:-1
Stage-1
  Reducer 16
  File Output Operator [FS_154]
Select Operator [SEL_153] (rows=2 width=560)
  Output:["_col0","_col1","_col2","_col3","_col4"]
  Merge Join Operator [MERGEJOIN_185] (rows=2 width=1140)
Conds:(Left 

[jira] [Created] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore

2020-07-27 Thread Jira
László Bodor created HIVE-23938:
---

 Summary: LLAP: JDK11 - some GC log file rotation related jvm 
arguments cannot be used anymore
 Key: HIVE-23938
 URL: https://issues.apache.org/jira/browse/HIVE-23938
 Project: Hive
  Issue Type: Improvement
Reporter: László Bodor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


hive-test-kube not running tests

2020-07-27 Thread Zoltan Haindrich

Hey All!

I guess some of you have noticed that sometimes builds doesn't start.
This was caused by that I've undersized the volume supposed to hold the build artifacts when I've originally created it...so the build results have filled up the disk...and 
it was not able to record results anymore...


I wanted to fix this by enabling branch indexing, because that will also remove old builds as well (so I just removed some garbage when I first saw this) but that would 
cause a bit of a trouble for us right now; so I should left it disabled.


I've resized the volume to provide adequate space for at least a few months.

Without branch indexing it will not notice which PRs were missed...will get to that soon; I've already prototyped the solution but deploying it will rerun tests on all open 
PRs..


cheers,
Zoltan


[jira] [Created] (HIVE-23937) Take null ordering into consideration when pushing TNK through inner joins

2020-07-27 Thread Attila Magyar (Jira)
Attila Magyar created HIVE-23937:


 Summary: Take null ordering into consideration when pushing TNK 
through inner joins
 Key: HIVE-23937
 URL: https://issues.apache.org/jira/browse/HIVE-23937
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: Attila Magyar
Assignee: Attila Magyar
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23936) Provide approximate number of input records to be processed in broadcast reader

2020-07-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23936:
---

 Summary: Provide approximate number of input records to be 
processed in broadcast reader
 Key: HIVE-23936
 URL: https://issues.apache.org/jira/browse/HIVE-23936
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


There are cases when broadcasted data is loaded into hashtable in upstream 
applications (e.g Hive). Apps tends to predict the number of entries in the 
hashtable diligently, but there are cases where these estimates can be very 
complicated at compile time.

 

Tez can help in such cases, by providing "approximate number of input records 
counter", to be processed in UnorderedKVInput. This is to avoid expensive 
rehash when hashtable sizes are not estimated correctly. It would be good to 
start with broadcast first and then to move on to unordered partitioned case 
later.

 

This would help in predicting the number of entries at runtime & can get better 
estimates for hashtable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)