Questions on LLAP and hive.server2.enable.doAs

2018-05-17 Thread Sungwoo Park
Hello,

I have a couple of questions on LLAP and hive.server2.enable.doAs. I've
learned that LLAP does not support hive.server2.enable.doAs=true, but what
if we disable LLAP IO? If LLAP IO is disabled and no cache is used in LLAP
daemons,  I guess it should be okay to allow  hive.server2.enable.doAs=true
because no data is shared between different queries inside LLAP daemons.

So, my questions are:

1. Does LLAP always disallow hive.server2.enable.doAs=true, whether LLAP IO
is enabled or not?
2. If LLAP disallows  hive.server2.enable.doAs=true even when LLAP IO is
disabled (i.e., when cache size is set to zero), why?

>From my own experiments, LLAP runs fast enough even without LLAP IO. In
such scenarios, setting hive.server2.enable.doAs=true can be useful.

Thanks in advance.

--- Sungwoo


Re: Questions on LLAP and hive.server2.enable.doAs

2018-05-17 Thread Sungwoo Park
For question 1, if hive.server2.enable.doAs is set to true, the AppMaster
fails to connect to LLAP daemons (from my experiments).

--- Sungwoo

On Fri, May 18, 2018 at 1:02 AM, Sungwoo Park <glap...@gmail.com> wrote:

> Hello,
>
> I have a couple of questions on LLAP and hive.server2.enable.doAs. I've
> learned that LLAP does not support hive.server2.enable.doAs=true, but
> what if we disable LLAP IO? If LLAP IO is disabled and no cache is used in
> LLAP daemons,  I guess it should be okay to allow
> hive.server2.enable.doAs=true because no data is shared between different
> queries inside LLAP daemons.
>
> So, my questions are:
>
> 1. Does LLAP always disallow hive.server2.enable.doAs=true, whether LLAP
> IO is enabled or not?
> 2. If LLAP disallows  hive.server2.enable.doAs=true even when LLAP IO is
> disabled (i.e., when cache size is set to zero), why?
>
> From my own experiments, LLAP runs fast enough even without LLAP IO. In
> such scenarios, setting hive.server2.enable.doAs=true can be useful.
>
> Thanks in advance.
>
> --- Sungwoo
>


Re: MERGE performances issue

2018-05-24 Thread Sungwoo Park
Hive-MR3 could be a solution for you. It supports everything that you
mention in the previous post. I have written a blog article discussing the
pros and cons of Hive-MR3 with respect to Hive-LLAP.

https://mr3.postech.ac.kr/blog/2018/05/19/comparison-hivemr3-llap/

--- Sungwoo

On Thu, May 10, 2018 at 4:44 AM, Nicolas Paris  wrote:

>
> ​True. I was using hive 1.2.1. Then I tested HIVE 2.10.​ The point is I am
> quite unclear​ on if HIVE 2.X is equivalent to
> HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined
> with Kerberos security since the LLAP daemon
> is hosted by HIVE, and apparently cannot do "doAs" to impersonate other
> users.
>
> If there is a way to use HIVE 2.X without LLAP and benefit from all the
> feature unless in memory computation, that would be
> a good point to me.
>
>


Announce: Hive-MR3 0.2

2018-05-24 Thread Sungwoo Park
Hello Hive users,

I am pleased to announce the release of Hive-MR3 0.2.

Hive-MR3 now supports LLAP I/O. I have published a blog article that
compares the stability and performance of Hive-MR3 and Hive-LLAP:
https://mr3.postech.ac.kr/blog/2018/05/19/comparison-hivemr3-llap/

>From the blog article, the pros of Hive-MR3 with respect to Hive-LLAP are:
1. Higher stability
2. Faster execution
3. Elastic allocation of cluster resources
4. Support for hive.server2.enable.doAs=true
7. Better support for concurrency

You can download Hive-MR3 0.2 at:
https://mr3.postech.ac.kr/download/home/

Any comment on Hive-MR3 0.2 will be appreciated. Thanks a lot!

--- Sungwoo


CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2018-06-12 Thread Sungwoo Park
Hello Hive users,

I am experience a problem with MetaStore in Hive 3.0.

1. Start MetaStore
with hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.

2. Generate TPC-DS data.

3. TPC-DS queries run okay and produce correct results. E.g., from query 1:
+---+
|   c_customer_id   |
+---+
| CHAA  |
| DCAA  |
| DDAA  |
...
| AAAILIAA  |
+---+
100 rows selected (69.901 seconds)

However, the query compilation takes long (
https://issues.apache.org/jira/browse/HIVE-16520).

4. Now, restart MetaStore with
hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.

5. TPC-DS queries run okay, but produce wrong results. E.g, from query 1:
++
| c_customer_id  |
++
++
No rows selected (37.448 seconds)

What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
HiveServer2 produces such log messages:

2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
c_customer_id
2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
c_customer_id

2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality
2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality
2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality

However, even after computing column stats, queries still return wrong
results, despite the fact that the above log messages disappear.

I guess I am missing some configuration parameters (because I imported
hive-site.xml from Hive 2). Any suggestion would be appreciated.

Thanks a lot,

--- Sungwoo Park


Re: issues with Hive 3 simple sellect from an ORC table

2018-06-12 Thread Sungwoo Park
This is a diff file that let me compile Hive 3.0 on Hadoop 2.8.0 (and also
run it on Hadoop 2.7.x).

diff --git a/pom.xml b/pom.xml
index c57ff58..8445288 100644
--- a/pom.xml
+++ b/pom.xml
@@ -146,7 +146,7 @@
 19.0
 2.4.11
 1.3.166
-3.1.0
+2.8.0

 
${basedir}/${hive.path.to.root}/testutils/hadoop
 1.3
 2.0.0-alpha4
@@ -1212,7 +1212,7 @@
   true
 
   
-  true
+  false
 
   
   
diff --git
a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
index b13f73b..21d8541 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
@@ -277,7 +277,7 @@ protected void openInternal(String[]
additionalFilesNotFromConf,
 } else {
   this.resources = new HiveResources(createTezDir(sessionId,
"resources"));
   ensureLocalResources(conf, additionalFilesNotFromConf);
-  LOG.info("Created new resources: " + resources);
+  LOG.info("Created new resources: " + this.resources);
 }

 // unless already installed on all the cluster nodes, we'll have to
@@ -639,7 +639,6 @@ public void ensureLocalResources(Configuration conf,
String[] newFilesNotFromCon
* @throws Exception
*/
   void close(boolean keepDagFilesDir) throws Exception {
-console = null;
 appJarLr = null;

 try {
@@ -665,6 +664,7 @@ void close(boolean keepDagFilesDir) throws Exception {
 }
   }
 } finally {
+  console = null;
   try {
 cleanupScratchDir();
   } finally {
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index 84ae157..be66787 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -160,7 +160,9 @@ public int execute(DriverContext driverContext) {
   if (userName == null) {
 userName = "anonymous";
   } else {
-groups =
UserGroupInformation.createRemoteUser(userName).getGroups();
+groups =
Arrays.asList(UserGroupInformation.createRemoteUser(userName).getGroupNames());
+// TODO: for Hadoop 2.8.0+, just call getGroups():
+//   groups =
UserGroupInformation.createRemoteUser(userName).getGroups();
   }
   MappingInput mi = new MappingInput(userName, groups,
   ss.getHiveVariables().get("wmpool"),
ss.getHiveVariables().get("wmapp"));
diff --git
a/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
b/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
index 1ae8194..aaf0c62 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
@@ -472,7 +472,7 @@ static EventLogger getInstance(HiveConf conf) {
   if (instance == null) {
 synchronized (EventLogger.class) {
   if (instance == null) {
-instance = new EventLogger(conf, SystemClock.getInstance());
+instance = new EventLogger(conf, new SystemClock());
 ShutdownHookManager.addShutdownHook(instance::shutdown);
   }
 }
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
b/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
index 183515a..2f393c3 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
@@ -1051,7 +1051,9 @@ else if (prev != null && next.maxWriteId ==
prev.maxWriteId
  */
 Collections.sort(original, (HdfsFileStatusWithId o1,
HdfsFileStatusWithId o2) -> {
   //this does "Path.uri.compareTo(that.uri)"
-  return o1.getFileStatus().compareTo(o2.getFileStatus());
+  return
o1.getFileStatus().getPath().compareTo(o2.getFileStatus().getPath());
+  // TODO: for Hadoop 2.8+
+  // return o1.getFileStatus().compareTo(o2.getFileStatus());
 });

 // Note: isRawFormat is invalid for non-ORC tables. It will always
return true, so we're good.
diff --git
a/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
b/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
index 5e117fe..4367107 100644
---
a/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
+++
b/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
@@ -76,7 +76,7 @@ public void setup() throws Exception {
   @Test
   public void testPreEventLog() throws Exception {
 context.setHookType(HookType.PRE_EXEC_HOOK);
-EventLogger evtLogger = new EventLogger(conf,
SystemClock.getInstance());
+EventLogger evtLogger = new EventLogger(conf, new SystemClock());
 evtLogger.handle(context);
 evtLogger.shutdown();

@@ -105,7 

Re: Question on accessing LLAP as data cache from external containers

2018-01-31 Thread Sungwoo Park
Thanks for the link. My question was how to access LLAP daemon from
Containers to retrieve data for Hive jobs.

For example, a Hive job may start Tez containers, which then retrieve data
from LLAP running concurrently. In the current implementation, this is
unrealistic (because every task can be just sent to LLAP daemon), but I
wonder if this is feasible in principle (with a bit of hacking).

On Tue, Jan 30, 2018 at 3:42 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Are you looking for sth like this:
> https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/
> CentralizedCacheManagement.html
>
> To answer your original question: why not implement the whole job in Hive?
> Or orchestrate using oozie  some parts in mr and some in Huve.
>
> On 30. Jan 2018, at 05:15, Sungwoo Park <glap...@gmail.com> wrote:
>
> Hello all,
>
> I wonder if an external YARN container can send requests to LLAP daemon to
> read data from its in-memory cache. For example, YARN containers owned by a
> typical MapReduce job (e.g., TeraSort) could fetch data directly from LLAP
> instead of contacting HDFS. In this scenario, LLAP daemon just serves IO
> requests from YARN containers and does not run its executors to perform
> non-trivial computation.
>
> If this is feasible, LLAP daemon can be shared by all services running in
> the cluster. Any comment would be appreciated. Thanks a lot.
>
> -- Gla Park
>
>


Question on accessing LLAP as data cache from external containers

2018-01-29 Thread Sungwoo Park
 Hello all,

I wonder if an external YARN container can send requests to LLAP daemon to
read data from its in-memory cache. For example, YARN containers owned by a
typical MapReduce job (e.g., TeraSort) could fetch data directly from LLAP
instead of contacting HDFS. In this scenario, LLAP daemon just serves IO
requests from YARN containers and does not run its executors to perform
non-trivial computation.

If this is feasible, LLAP daemon can be shared by all services running in
the cluster. Any comment would be appreciated. Thanks a lot.

-- Gla Park


Re: Announce: MR3 0.3, and performance comparison with Hive-LLAP, Presto, Spark, Hive on Tez

2018-08-16 Thread Sungwoo Park
The article can be found at:

https://mr3.postech.ac.kr/blog/2018/08/15/comparison-llap-presto-spark-mr3/

-- Sungwoo Park

On Thu, Aug 16, 2018 at 10:53 PM, Sungwoo Park  wrote:

> Hello Hive users,
>
> I am pleased to announce the release of MR3 0.3. A new feature of MR3 0.3
> is its support for Hive 3.0.0 on Hadoop 2.7/2.8/2.9. I have also published
> a blog article that uses the TPC-DS benchmark to compare the following six
> systems:
>
> 1) Hive-LLAP included in HDP 2.6.4
> 2) Presto 0.203e
> 3) Spark 2.2.0 included in HDP 2.6.4
> 4) Hive 3.0.0 on Tez
> 5) Hive 3.0.0 on MR3
> 6) Hive 2.3.3 on MR3
>
> You can download MR3 0.3 at:
>
> https://mr3.postech.ac.kr/download/home/
>
> Thank you for your interest!
>
> --- Sungwoo Park
>
>


Announce: MR3 0.3, and performance comparison with Hive-LLAP, Presto, Spark, Hive on Tez

2018-08-16 Thread Sungwoo Park
Hello Hive users,

I am pleased to announce the release of MR3 0.3. A new feature of MR3 0.3
is its support for Hive 3.0.0 on Hadoop 2.7/2.8/2.9. I have also published
a blog article that uses the TPC-DS benchmark to compare the following six
systems:

1) Hive-LLAP included in HDP 2.6.4
2) Presto 0.203e
3) Spark 2.2.0 included in HDP 2.6.4
4) Hive 3.0.0 on Tez
5) Hive 3.0.0 on MR3
6) Hive 2.3.3 on MR3

You can download MR3 0.3 at:

https://mr3.postech.ac.kr/download/home/

Thank you for your interest!

--- Sungwoo Park


Fwd: Hive generating different DAGs from the same query

2018-07-19 Thread Sungwoo Park
Hello Zoltan,

I further tested, and found no Exception (such as
MapJoinMemoryExhaustionError) during the run. So, the query ran fine. My
conclusion is that a query can update some internal states of HiveServer2,
affecting DAG generation for subsequent queries. Moreover, the same query
may or may not affect DAG generation.

This issue is not related to query reexecution, as even with query
reexecution disabled (hive.query.reexecution.enabled set to false), I still
see this problem occurring.

--- Sungwoo Park

On Fri, Jul 13, 2018 at 4:48 PM, Zoltan Haindrich  wrote:

> Hello Sungwoo!
>
> I think its possible that reoptimization is kicking in, because the first
> execution have bumped into an exception.
>
> I think the plans should not be changing permanently; unless
> "hive.query.reexecution.stats.persist.scope" is set to a wider scope than
> query.
>
> To check that indeed reoptimization is happening(or not) look for:
>
> cat > patterns << EOF
> org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError
> reexec
> Driver.java:execute
> SessionState.java:printError
> EOF
>
> cat patterns
>
> fgrep -Ff patterns --color=yes /var/log/hive/hiveserver2.log | grep -v
> DEBUG
>
> cheers,
> Zoltan
>
>
> On 07/11/2018 10:40 AM, Sungwoo Park wrote:
>
>> Hello,
>>
>> I am running the TPC-DS benchmark using Hive 3.0, and I find that Hive
>> sometimes produces different DAGs from the same query. These are the two
>> scenarios for the experiment. The execution engine is tez, and the TPC-DS
>> scale factor is 3TB.
>>
>> 1. Run query 19 to query 24 sequentially in the same session. The first
>> part of query 24 takes about 156 seconds:
>>
>> 100 rows selected (58.641 seconds) <-- query 19
>> 100 rows selected (16.117 seconds)
>> 100 rows selected (9.841 seconds)
>> 100 rows selected (35.195 seconds)
>> 1 row selected (258.441 seconds)
>> 59 rows selected (213.156 seconds)
>> 4,643 rows selected (156.982 seconds) <-- the first part of query 24
>> 1,656 rows selected (136.382 seconds)
>>
>> 2. Now run query 1 to query 24 sequentially in the same session. This
>> time the first part of query 24 takes more than 1000 seconds:
>>
>> 100 rows selected (94.981 seconds) <-- query 1
>> 2,513 rows selected (30.804 seconds)
>> 100 rows selected (11.076 seconds)
>> 100 rows selected (225.646 seconds)
>> 100 rows selected (44.186 seconds)
>> 52 rows selected (11.436 seconds)
>> 100 rows selected (21.968 seconds)
>> 11 rows selected (14.05 seconds)
>> 1 row selected (35.619 seconds)
>> 100 rows selected (27.062 seconds)
>> 100 rows selected (134.098 seconds)
>> 100 rows selected (7.65 seconds)
>> 1 row selected (14.54 seconds)
>> 100 rows selected (143.965 seconds)
>> 100 rows selected (101.676 seconds)
>> 100 rows selected (19.742 seconds)
>> 1 row selected (245.381 seconds)
>> 100 rows selected (71.617 seconds)
>> 100 rows selected (23.017 seconds)
>> 100 rows selected (10.888 seconds)
>> 100 rows selected (11.149 seconds)
>> 100 rows selected (7.919 seconds)
>> 100 rows selected (29.527 seconds)
>> 1 row selected (220.516 seconds)
>> 59 rows selected (204.363 seconds)
>> 4,643 rows selected (1008.514 seconds) <-- the first part of query 24
>> 1,656 rows selected (141.279 seconds)
>>
>> Here are a few findings from the experiment:
>>
>> 1. The two DAGs for the first part of query 24 are quite similar, but
>> actually different. The DAG from the first scenario contains 17 vertices,
>> whereas the DAG from the second scenario contains 18 vertices, skipping
>> some part of map-side join that is performed in the first scenario.
>>
>> 2. The configuration (HiveConf) inside HiveServer2 is precisely the same
>> before running the first part of query 24 (except for minor keys).
>>
>> So, I wonder how Hive can produce different DAGs from the same query. For
>> example, is there some internal configuration key in HiveConf that
>> enables/disables some optimization depending on the accumulate statistics
>> in HiveServer2? (I haven't tested it yet, but I can also test with Hive
>> 2.x.)
>>
>> Thank you in advance,
>>
>> --- Sungwoo Park
>>
>>


Re: Does Hive 3.0 only works with hadoop3.x.y?

2018-07-19 Thread Sungwoo Park
I would say yes (because I am actually running Hive 3.0 on Hadoop 2.7.6 and
HDP 2.7.5), provided that you make small changes to the source code to Hive
3.0. However, I have not tested Hive 3.0 on Spark.

--- Sungwoo

On Thu, Jul 19, 2018 at 10:34 PM, 彭鱼宴 <461292...@qq.com> wrote:

> Hi Sungwoo,
>
> Just want to confirm, does that mean I just need to update the hive
> version, without updating the hadoop version?
>
> Thanks!
>
> Best,
> Zhefu Peng
>
>
> ------ 原始邮件 --
> *发件人:* "Sungwoo Park";
> *发送时间:* 2018年7月19日(星期四) 晚上8:20
> *收件人:* "user";
> *主题:* Re: Does Hive 3.0 only works with hadoop3.x.y?
>
> Hive 3.0 make a few function calls that depend on Hadoop 3.x, but they are
> easy to replace with code that compiles okay on Hadoop 2.8+. I am currently
> running Hadoop 3.x on Hadoop 2.7.6 and HDP 2.6.4 to test with the TPC-DS
> benchmark, and have not encountered any compatibility issue yet. I
> previously posted a diff file that lets us compile Hadoop 3.x on Hadoop
> 2.8+.
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201806.mbox/%
> 3CCAKHFPXDDFn52buKetHzSXTtjzX3UMHf%3DQvxm9QNNkv9r5xBs-Q%
> 40mail.gmail.com%3E
>
> --- Sungwoo Park
>
>
> On Thu, Jul 19, 2018 at 8:21 PM, 彭鱼宴 <461292...@qq.com> wrote:
>
>> Hi,
>>
>> I already deployed hive 2.2.0 on our hadoop cluster. And recently, we
>> deployed the spark cluster with 2.3.0, aiming at using the feature that
>> hive on spark engine. However, when I checked the website of hive release,
>> I found the text below:
>> 21 May 2018 : release 3.0.0 available
>> <https://hive.apache.org/downloads.html#21-may-2018-release-300-available>
>>
>> This release works with Hadoop 3.x.y.
>>
>> Now the hadoop version we deployed is hadoop 2.7.6. I wonder, does Hive
>> 3.0 only work with hadoop 3.x.y? Or, if we want to use hive 3.0, we have to
>> update the hadoop version to 3.x.y?
>>
>> Looking forward to your reply and help.
>>
>> Best,
>>
>> Zhefu Peng
>>
>
>


Re: Does Hive 3.0 only works with hadoop3.x.y?

2018-07-19 Thread Sungwoo Park
Hive 3.0 make a few function calls that depend on Hadoop 3.x, but they are
easy to replace with code that compiles okay on Hadoop 2.8+. I am currently
running Hadoop 3.x on Hadoop 2.7.6 and HDP 2.6.4 to test with the TPC-DS
benchmark, and have not encountered any compatibility issue yet. I
previously posted a diff file that lets us compile Hadoop 3.x on Hadoop
2.8+.

http://mail-archives.apache.org/mod_mbox/hive-user/201806.mbox/%3CCAKHFPXDDFn52buKetHzSXTtjzX3UMHf%3DQvxm9QNNkv9r5xBs-Q%40mail.gmail.com%3E


--- Sungwoo Park


On Thu, Jul 19, 2018 at 8:21 PM, 彭鱼宴 <461292...@qq.com> wrote:

> Hi,
>
> I already deployed hive 2.2.0 on our hadoop cluster. And recently, we
> deployed the spark cluster with 2.3.0, aiming at using the feature that
> hive on spark engine. However, when I checked the website of hive release,
> I found the text below:
> 21 May 2018 : release 3.0.0 available
> <https://hive.apache.org/downloads.html#21-may-2018-release-300-available>
>
> This release works with Hadoop 3.x.y.
>
> Now the hadoop version we deployed is hadoop 2.7.6. I wonder, does Hive
> 3.0 only work with hadoop 3.x.y? Or, if we want to use hive 3.0, we have to
> update the hadoop version to 3.x.y?
>
> Looking forward to your reply and help.
>
> Best,
>
> Zhefu Peng
>


Re: Hive generating different DAGs from the same query

2018-09-11 Thread Sungwoo Park
Hello Gopal,

I have been looking further into this issue, and have found that the
non-determinstic behavior of Hive in
generating DAGs is actually due to the logic in
AggregateStatsCache.findBestMatch() called from
AggregateStatsCache.get(), as well as the disproportionate distribution of
Nulls in __HIVE_DEFAULT_PARTITION__
(in the case of the TPC-DS dataset).

Here is what is happening. Let me use web_sales table and ws_web_site_sk
column in the 10TB TPC-DS dataset as
a running example.

1. In the course of running TPC-DS queries, Hive asks MetaStore about the
column statistics of 1823 partNames
in the web_sales/ws_web_site_sk combination, either without
__HIVE_DEFAULT_PARTITION__ or with
__HIVE_DEFAULT_PARTITION__.

  --- Without __HIVE_DEFAULT_PARTITION__, it reports a total of 901180
nulls.

  --- With __HIVE_DEFAULT_PARTITION__, however, it report a total of
1800087 nulls, almost twice as many.

2. The first call to MetaStore returns the correct result, but all
subsequent requests are likely to
return the same result from the cache, irrespective of the inclusion of
__HIVE_DEFAULT_PARTITION__. This is
because AggregateStatsCache.findBestMatch() treats
__HIVE_DEFAULT_PARTITION__ in the same way as other
partNames, and the difference in the size of partNames[] is just 1. The
outcome depends on the duration of
intervening queries, so everything is now non-deterministic.

3. If a wrong value of numNulls is returned, Hive generates a different
DAG, which usually takes much longer
than the correct one (e.g., 150s to 1000s for the first part of Query 24,
and 40s to 120s for Query 5).  I
guess the problem is particularly pronounced here because of the huge
number of nulls in
__HIVE_DEFAULT_PARTITION__. It is ironic to see that the query optimizer is
so efficient that a single wrong
guess of numNulls creates a very inefficient DAG.

Note that this behavior cannot be avoided by setting
hive.metastore.aggregate.stats.cache.max.variance to zero
because the difference in the number of partNames[] between the argument
and the entry in the cache is just 1.

I think that AggregateStatsCache.findBestMatch() should treat
__HIVE_DEFAULT_PARTITION__ in a special way, by
not returning the result in the cache if there is a difference in the
inclusion of partName
__HIVE_DEFAULT_PARTITION__ (or should provide the use with an option to
activate this feature). However, I am
testing only with the TPC-DS data, so please take my claim with a grain of
salt.

--- Sungwoo


On Fri, Jul 20, 2018 at 2:54 PM Gopal Vijayaraghavan 
wrote:

> > My conclusion is that a query can update some internal states of
> HiveServer2, affecting DAG generation for subsequent queries.
>
> Other than the automatic reoptimization feature, there's two other
> potential suspects.
>
> First one would be to disable the in-memory stats cache's variance param,
> which might be triggering some residual effects.
>
> hive.metastore.aggregate.stats.cache.max.variance
>
> I set it to 0.0 when I suspect that feature is messing with the runtime
> plans or just disable the cache entirely with
>
> set hive.metastore.aggregate.stats.cache.enabled=false;
>
> Other than that, query24 is an interesting query.
>
> Is probably one of the corner cases where the predicate push-down is
> actually hurting the shared work optimizer.
>
> Also cross-check if you have accidentally loaded store_sales with
> ss_item_sk(int) and if the item i_item_sk is a bigint (type mismatches will
> trigger a slow join algorithm, but without any consistency issues).
>
> Cheers,
> Gopal
>
>
>


Re: Announce: MR3 0.3, and performance comparison with Hive-LLAP, Presto, Spark, Hive on Tez

2018-09-11 Thread Sungwoo Park
Thank you for reading the article. I plan to publish the result of running
concurrent queries with the release of MR3 0.4 (which implements a feature
affecting the performance of concurrent queries).

--- Sungwoo

On Sat, Sep 8, 2018 at 8:06 AM Nicolas Paris 
wrote:

>
> On Thu, Aug 16, 2018 at 10:55:19PM +0900, Sungwoo Park wrote:
> > The article  compare the following six systems:
>
> Great article, as usual. Would have been great to also compare
> concurrent queries. In particular, I guess presto on that point perform
> the best. That metric is major since such technology is usually shared
> by many users.
>
> --
> nicolas
>


Hive generating different DAGs from the same query

2018-07-11 Thread Sungwoo Park
Hello,

I am running the TPC-DS benchmark using Hive 3.0, and I find that Hive
sometimes produces different DAGs from the same query. These are the two
scenarios for the experiment. The execution engine is tez, and the TPC-DS
scale factor is 3TB.

1. Run query 19 to query 24 sequentially in the same session. The first
part of query 24 takes about 156 seconds:

100 rows selected (58.641 seconds) <-- query 19
100 rows selected (16.117 seconds)
100 rows selected (9.841 seconds)
100 rows selected (35.195 seconds)
1 row selected (258.441 seconds)
59 rows selected (213.156 seconds)
4,643 rows selected (156.982 seconds) <-- the first part of query 24
1,656 rows selected (136.382 seconds)

2. Now run query 1 to query 24 sequentially in the same session. This time
the first part of query 24 takes more than 1000 seconds:

100 rows selected (94.981 seconds) <-- query 1
2,513 rows selected (30.804 seconds)
100 rows selected (11.076 seconds)
100 rows selected (225.646 seconds)
100 rows selected (44.186 seconds)
52 rows selected (11.436 seconds)
100 rows selected (21.968 seconds)
11 rows selected (14.05 seconds)
1 row selected (35.619 seconds)
100 rows selected (27.062 seconds)
100 rows selected (134.098 seconds)
100 rows selected (7.65 seconds)
1 row selected (14.54 seconds)
100 rows selected (143.965 seconds)
100 rows selected (101.676 seconds)
100 rows selected (19.742 seconds)
1 row selected (245.381 seconds)
100 rows selected (71.617 seconds)
100 rows selected (23.017 seconds)
100 rows selected (10.888 seconds)
100 rows selected (11.149 seconds)
100 rows selected (7.919 seconds)
100 rows selected (29.527 seconds)
1 row selected (220.516 seconds)
59 rows selected (204.363 seconds)
4,643 rows selected (1008.514 seconds) <-- the first part of query 24
1,656 rows selected (141.279 seconds)

Here are a few findings from the experiment:

1. The two DAGs for the first part of query 24 are quite similar, but
actually different. The DAG from the first scenario contains 17 vertices,
whereas the DAG from the second scenario contains 18 vertices, skipping
some part of map-side join that is performed in the first scenario.

2. The configuration (HiveConf) inside HiveServer2 is precisely the same
before running the first part of query 24 (except for minor keys).

So, I wonder how Hive can produce different DAGs from the same query. For
example, is there some internal configuration key in HiveConf that
enables/disables some optimization depending on the accumulate statistics
in HiveServer2? (I haven't tested it yet, but I can also test with Hive
2.x.)

Thank you in advance,

--- Sungwoo Park


[Announce] Hive-MR3: Hive running on top of MR3

2018-04-04 Thread Sungwoo Park
Hello Hive users,

I am pleased to announce MR3 and Hive-MR3. Please visit the following
webpage for everything on MR3 and Hive-MR3:

https://mr3.postech.ac.kr/
http://datamonad.com

Here is a description of MR3 and Hive-MR3 from the webpage:

MR3 is a new execution engine for Hadoop. Similar in spirit to Tez, it can
be thought of as an enhancement of Tez with simpler design, better
performance, and more features. MR3 is ready for production use as it
supports all major features from Tez such as Kerberos-based security,
authentication and authorization, fault-tolerance, and recovery. MR3 is
implemented in Scala.

Hive-MR3 is an extension of Hive that runs on top of MR3. In order to
exploit new features in MR3, Hive-MR3 is built on a modified backend of
Hive. In comparison with Hive-on-Tez, Hive-MR3 generally runs faster for
sequential queries by virtue of the simple architectual design of
ApplicationMaster in MR3. In particular, it makes a better utilization of
computing resources and thus yields a higher throughput for concurrent
queries.

--- Sungwoo Park


Re: [Announce] Hive-MR3: Hive running on top of MR3

2018-04-05 Thread Sungwoo Park
Yes, the idea is almost the same -- LLAP daemons can accept tasks from
different Tez AMs, whereas MR3 containers can accept tasks from different
DAGs. A minor difference is that in the case of MR3, a single shared AM can
manage multiple concurrent DAGs. As a result, there is no need to start a
new AM for each Beeline connection.

Hive-LLAP on MR3 is currently under development, and will be released as
part of Hive-MR3 0.2. In the meanwhile, let me test Hive-LLAP on Tez and
Hive-MR3 1.0 for performance and report the result in the MR3 blog.

--- Sungwoo

On Thu, Apr 5, 2018 at 12:38 AM, Thai Bui <blquyt...@gmail.com> wrote:

> It would be interesting to see how this compares to Hive LLAP on Tez.
> Since the llap daemons contain a queue of tasks that is shared amongst many
> Tez AMs, it could have similar characteristics to the way MR3 is sharing
> the containers between the AMs.
>
> On Wed, Apr 4, 2018 at 10:06 AM Sungwoo Park <glap...@gmail.com> wrote:
>
>> Hello Hive users,
>>
>> I am pleased to announce MR3 and Hive-MR3. Please visit the following
>> webpage for everything on MR3 and Hive-MR3:
>>
>> https://mr3.postech.ac.kr/
>> http://datamonad.com
>>
>> Here is a description of MR3 and Hive-MR3 from the webpage:
>>
>> MR3 is a new execution engine for Hadoop. Similar in spirit to Tez, it
>> can be thought of as an enhancement of Tez with simpler design, better
>> performance, and more features. MR3 is ready for production use as it
>> supports all major features from Tez such as Kerberos-based security,
>> authentication and authorization, fault-tolerance, and recovery. MR3 is
>> implemented in Scala.
>>
>> Hive-MR3 is an extension of Hive that runs on top of MR3. In order to
>> exploit new features in MR3, Hive-MR3 is built on a modified backend of
>> Hive. In comparison with Hive-on-Tez, Hive-MR3 generally runs faster for
>> sequential queries by virtue of the simple architectual design of
>> ApplicationMaster in MR3. In particular, it makes a better utilization of
>> computing resources and thus yields a higher throughput for concurrent
>> queries.
>>
>> --- Sungwoo Park
>>
>> --
> Thai
>


Re: Ways to reduce launching time of query in Hive 2.2.1

2018-04-16 Thread Sungwoo Park
Do you use Tez session pool along with LLAP (as Thai suggests in the
previous reply)? If a new query finds an idle AM in Tez session pool, there
will be no launch cost for AM. If no idle AM is found or if you specify a
queue name, a new AM should start in order to serve the query. This is
explained in detail in the following article (see 'Understanding #4'):

https://community.hortonworks.com/articles/56636/hive-understanding-concurrent-sessions-queue-alloc.html

Hence, if not enough AMs are available in Tez session pool, new queries
will have to wait until old queries are finished. If there are not many
concurrent queries, I guess using Tez session pool will solve your issue.

In a highly concurrent setting, Hive-MR3 practically eliminates this
limitation. In Hive-MR3, HiveServer2 in shared session mode launches a
single AppMaster to be shared by all incoming queries, so there is no
launch cost. Containers are also shared by all queries and thus run like
daemons.

https://mr3.postech.ac.kr/hivemr3/features/hiveserver2/

Hive-MR3 0.1 does not support LLAP IO yet, but Hive-MR3 0.2 will support
LLAP IO (which will be released by the end of this month.)

--- Sungwoo Park




On Mon, Apr 16, 2018 at 11:33 PM, Anup Tiwari <anupsdtiw...@gmail.com>
wrote:

> Hi All,
>
> We have a use case where we need to return output in < 10 sec. We have
> evaluated different set of tool for execution and they work find but they
> do not cover all cases as well as they are not reliable(since they are in
> evolving phase). But Hive works well in this context.
>
> Using Hive LLAP, we have reduced query time to 6-7sec. But query launching
> takes ~12-15 sec due to which response time becomes 18-21 sec.
>
> Is there any way we can reduce this launching time?
>
> Please note that we have tried prewarm containers but when we are
> launching query from hive client then it is not picking containers from
> already initialized containers rather it launches its own.
>
> Please let me know how can we overcome this issue since this is the only
> problem which is stopping us from using Hive. Any links/description is
> really appreciated.
>
>
> Regards,
> Anup Tiwari
>


Re: FW: NPE in hive 2.3.x during window operator

2018-03-25 Thread Sungwoo Park
Not suggesting a solution or a workaround, but I am curious if you produced
this NPE when running with LLAP, or even without LLAP (Hive 2.3.2).

For you information, running the TPC-DS benchmark (1TB scale) without LLAP
did not produce this NPE. But in other environments, the benchmark produced
this NPE.

--- Sungwoo Park

On Wed, Mar 21, 2018 at 9:24 AM, Anuj Lal <a...@lendingclub.com> wrote:

>
>
>
> We are also facing the issue as described  in
>
> https://issues.apache.org/jira/browse/HIVE-18786?page=
> com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
>
>
>
> Any body facing similar and know about any work around
>
>
>
> AL
>
>
>
>
> DISCLAIMER: The information transmitted is intended only for the person or
> entity to which it is addressed and may contain confidential and/or
> privileged material. Any review, re-transmission, dissemination or other
> use of, or taking of any action in reliance upon this information by
> persons or entities other than the intended recipient is prohibited except
> as set forth below. If you received this in error, please contact the
> sender and destroy any copies of this document and any attachments. Email
> sent or received by LendingClub Corporation, its subsidiaries and
> affiliates is subject to archival, monitoring and/or review by and/or
> disclosure to someone other than the recipient.
>


Announce: MR3 0.4 released

2018-11-01 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.4. A new feature of MR3 0.4
is its support for Hive 3.1.0 and Hadoop 3.1.0.

As with previous releases, I have published a blog article that evaluates
the performance of popular SQL-on-Hadoop systems. It compares the following
systems using both sequential tests and concurrency tests:

1. Hive-LLAP included in HDP 2.6.4
1'. Hive-LLAP included in HDP 3.0.1
2. Presto 0.203e (with cost-based optimization enabled)
2'. Presto 0.208e (with cost-based optimization enabled)
3. SparkSQL 2.2.0 included in HDP 2.6.4
3'. SparkSQL 2.3.1 included in HDP 3.0.1
4. Hive 3.1.0 running on top of Tez
4'. Hive on Tez included in HDP 3.0.1
5. Hive 3.1.0 running on top of MR3 0.4
6. Hive 2.3.3 running on top of MR3 0.4

I use three clusters (11 nodes, 21 nodes, 42 nodes) in the experiment. The
blog article can be found at:

https://mr3.postech.ac.kr/blog/2018/10/31/performance-evaluation-0.4/

You can download MR3 0.4 at:

https://mr3.postech.ac.kr/download/home/

--- Sungwoo Park


Announce: MR3 0.6 released

2019-03-23 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.6. New key features are:

- In Hive on Kubernetes, DAGAppMaster can run in its own Pod.
- MR3-UI requires only Timeline Server.
- Hive on MR3 is much more stable because it supports memory monitoring
when loading hash tables for Map-side join.

You can download MR3 0.6 at:

https://mr3.postech.ac.kr/download/home/

With the release of MR3 0.6, I ran experiments to compare the performance
of Impala, Presto, and Hive on MR3. The result can be found in a new
article:

https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/

I hope you enjoy reading the article.

--- Sungwoo


Announce: MR3 0.5 released (with Hive on Kubernetes)

2019-02-20 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.5.

A new feature of MR3 0.5 is its native support for Kubernetes. Now the user
can run Hive on Kubernetes by using MR3 as its execution engine. The
documentation on Hive on Kubernetes can be found at:

https://mr3.postech.ac.kr/hivek8s/home/

MR3 0.5 also supports Hive 3.1.1 and Hive 2.3.4.

You can download MR3 0.5 at:

https://mr3.postech.ac.kr/download/home/

--- Sungwoo Park


Re: Hive on Tez vs Impala

2019-04-15 Thread Sungwoo Park
I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
new execution engine for Hadoop and Kubernetes). You can find the result at:

https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/

On average, Hive on MR3 is about 30% faster than Hive on Tez on sequential
queries. For concurrent queries, the throughput of Hive on MR3 is about
three times higher than Hive on Tez (when tested with 16 concurrent
queries). You can find the result at:

https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/

--- Sungwoo Park

On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko 
wrote:

> Hi,
> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
> I can't find the info regarding Hive on Tez performance compared to Impala.
> Does someone know or compared it?
>
> Thanks
>
> Artur Sukhenko
>


Announce: MR3 0.7 released

2019-04-27 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.7. New key features are:

- MR3 is fully fault-tolerant. (
https://mr3.postech.ac.kr/mr3/features/fault-tolerance/)
- The logic for node blacklisting has been simplified. (
https://mr3.postech.ac.kr/mr3/features/blacklisting/)

I have tested the correctness of Hive on MR3 by cross-checking the result
of TPC-DS benchmark against Impala and Presto. Nothing particularly
surprising, except that for some queries, Impala returns different results
than Hive on MR3 and Presto. Hive on MR3 is consistent with Presto for
TPC-DS benchmark.

I have created an FAQ page:

  https://mr3.postech.ac.kr/faq/home/

You can download MR3 0.7 at:

  https://mr3.postech.ac.kr/download/home/

--- Sungwoo Park


Re: Announce: MR3 0.8 released

2019-06-27 Thread Sungwoo Park
I have created a quick start guide showing how to run Hive-MR3 on
Kubernetes using Minikube on a single machine. If you are interested in
trying Hive on Kubernetes on your laptop, please check out this page:

https://mr3.postech.ac.kr/quickstart/hivek8s/run-k8s/

--- Sungwoo

On Wed, Jun 26, 2019 at 7:56 PM Sungwoo Park  wrote:

> I am pleased to announce the release of MR3 0.8. New features are:
>
> -- Hive on MR3 on Yarn fully supports recovery:
> https://mr3.postech.ac.kr/hivemr3/features/recovery/
>
> -- Hive on MR3 on Yarn supports high availability in which multiple
> HiveServer2 instances share a common DAGAppMaster (and a common pool of
> ContainerWorkers):
> https://mr3.postech.ac.kr/hivemr3/features/high-availability/
>
> -- Hive on Kubernetes supports Apache Ranger and Timeline Server:
> https://mr3.postech.ac.kr/hivek8s/guide/run-timeline/
> https://mr3.postech.ac.kr/hivek8s/guide/run-ranger/
>
> From the release notes:
>
> A new DAGAppMaster properly recovers DAGs that have not been completed in
> the previous DAGAppMaster.
> Fault tolerance after fetch failures works much faster.
> On Kubernetes, the shutdown handler of DAGAppMaster deletes all running
> Pods.
> On both Yarn and Kubernetes, MR3Client automatically connects to a new
> DAGAppMaster after an initial DAGAppMaster is killed.
> Hive 3 for MR3 supports high availability on Yarn via ZooKeeper.
> On both Yarn and Kubernetes, multiple HiveServer2 instances can share a
> common MR3 DAGAppMaster (and thus all its ContainerWorkers as well).
> Hive on Kubernetes supports Apache Ranger.
> Hive on Kubernetes supports Timeline Server.
>
> You can download MR3 0.8 at:
>
>   https://mr3.postech.ac.kr/download/home/
>
> --- Sungwoo Park
>


Fwd: Announce: MR3 0.8 released

2019-06-26 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.8. New features are:

-- Hive on MR3 on Yarn fully supports recovery:
https://mr3.postech.ac.kr/hivemr3/features/recovery/

-- Hive on MR3 on Yarn supports high availability in which multiple
HiveServer2 instances share a common DAGAppMaster (and a common pool of
ContainerWorkers):
https://mr3.postech.ac.kr/hivemr3/features/high-availability/

-- Hive on Kubernetes supports Apache Ranger and Timeline Server:
https://mr3.postech.ac.kr/hivek8s/guide/run-timeline/
https://mr3.postech.ac.kr/hivek8s/guide/run-ranger/

>From the release notes:

A new DAGAppMaster properly recovers DAGs that have not been completed in
the previous DAGAppMaster.
Fault tolerance after fetch failures works much faster.
On Kubernetes, the shutdown handler of DAGAppMaster deletes all running
Pods.
On both Yarn and Kubernetes, MR3Client automatically connects to a new
DAGAppMaster after an initial DAGAppMaster is killed.
Hive 3 for MR3 supports high availability on Yarn via ZooKeeper.
On both Yarn and Kubernetes, multiple HiveServer2 instances can share a
common MR3 DAGAppMaster (and thus all its ContainerWorkers as well).
Hive on Kubernetes supports Apache Ranger.
Hive on Kubernetes supports Timeline Server.

You can download MR3 0.8 at:

  https://mr3.postech.ac.kr/download/home/

--- Sungwoo Park


Fwd: Article on the correctness of Hive on MR3, Presto, and Impala

2019-06-26 Thread Sungwoo Park
I have published a new article on the correctness of Hive on MR3, Presto,
and Impala:

https://mr3.postech.ac.kr/blog/2019/06/26/correctness-hivemr3-presto-impala/

Hope you enjoy reading the article.

--- Sungwoo


Re: Article on the correctness of Hive on MR3, Presto, and Impala

2019-06-26 Thread Sungwoo Park
I think yes -- if you would like to scrutinize the results, perhaps sorting
and conducting diff would be the best way. If you would like to test the
results quickly with a bit of uncertainty allowed, I guess comparing the
number of rows would be sufficient because two different results are
unlikely to contain the same number of rows, e.g., 440704 rows.

--- Sungwoo

On Wed, Jun 26, 2019 at 9:01 PM Edward Capriolo 
wrote:

> I like the approach of applying an arbitrary limit. Hive's q files tend to
> add an ordering to everything. Would it make sense to simply order by
> multiple columns in the result set and conduct a large diff on them?
>
> On Wednesday, June 26, 2019, Sungwoo Park  wrote:
>
>> I have published a new article on the correctness of Hive on MR3, Presto,
>> and Impala:
>>
>>
>> https://mr3.postech.ac.kr/blog/2019/06/26/correctness-hivemr3-presto-impala/
>>
>> Hope you enjoy reading the article.
>>
>> --- Sungwoo
>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Re: Filters with IN clause are getting omitted

2019-04-23 Thread Sungwoo Park
Not  solution to the problem on HDP 2.6.5, but I have tested the first
script in Hive 2.3.4 and Hive 3.1.1. On Hive 2.3.4, it returns 1 row, and
on Hive 3.1.1, it returns no row. So, I guess the bug is still in HDP 2.6.5.

--- Sungwoo

On Tue, Apr 23, 2019 at 7:40 PM Rajat Khandelwal  wrote:

> Hi
>
> I've recently noticed incorrect behaviour from Hive Query Planner. The
> simplest example I could construct is as follows
>
> SELECT
> tbl3.col2 AS current_regularity_streak
> FROM (select 1 col1) tbl1
> LEFT JOIN
> (select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
> LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
> WHERE
> tbl1.col1 in (select 1 col1 union all select 2)
> AND
> tbl3.col2 >= 2
>
>
> The query should logically return no rows, but it does! It returns 1 row
> with 1 column, with value = 1. The value=1 should have been filtered out by
> the filter tbl3.col2 >= 2
>   So  df
>
> On further examination, I believe the culprit is the IN clause. If I
> remove this, the query works correctly and returns 0 rows.
>
> SELECT
> tbl3.col2 AS current_regularity_streak
> FROM (select 1 col1) tbl1
> LEFT JOIN
> (select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
> LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
> WHERE
> tbl3.col2 >= 2
>
> Is this a known issue? I couldn't find anything on JIRA/Stack
> overflow/Google.
>
> I further analyzed using EXPLAIN FORMATTED and noticed that the plan of
> the first query doesn't contain the >=2 predicate. The plan of the second
> query does. I wonder how the planner could omit the filter clause
> altogether?
>
> I'm using HDP 2.6.5.10-2.
>


Presto 317 vs Hive on MR3 0.10 (snapshot)

2019-08-22 Thread Sungwoo Park
Hello Hive users,

I have published a new article that compares Presto 317 and Hive 3.1.1 on
MR3 0.10 (snapshot).

https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto317-0.10/

I haven't tested myself, but I guess Hive-LLAP also runs much faster than
Presto.

--- Sungwoo


Re: Apache Hive 2.3.4 - Issue with combination of Like operator & newline (\n) character in data

2019-07-29 Thread Sungwoo Park
Not a solution, but one can use \n in the search string, e.g.:

select * from default.withdraw where id like '%withdraw\ncash';
select * from default.withdraw where id like '%withdraw%\ncash';
select * from default.withdraw where id like '%withdraw%\n%cash';

--- Sungwoo


On Tue, Jul 30, 2019 at 12:58 AM Shankar Mane  wrote:

> Can anyone looking at this issue ?
>
> On Sat, Jul 20, 2019 at 9:08 AM Shankar Mane  wrote:
>
>> Have created jira at https://issues.apache.org/jira/browse/HIVE-22008
>> 
>>
>> On Wed, 17 Jul 2019, 16:44 Shankar Mane,  wrote:
>>
>>> Hi All,
>>>
>>> I am facing some issues while using Like operator & newline (\n)
>>> character. Below is the in details description :
>>>
>>> *-- Hive Queries
>>> *
>>>
>>> create table default.withdraw(
>>> id string
>>> ) stored as parquet;
>>>
>>>
>>> insert into default.withdraw select 'withdraw\ncash';
>>>
>>>
>>> *--1)  result = success*
>>>
>>> hive> select * from default.withdraw where id like '%withdraw%';
>>> OK
>>> withdraw
>>> cash
>>> Time taken: 0.078 seconds, Fetched: 1 row(s)
>>>
>>>
>>> *--2) **result = wrong*
>>>
>>> hive> select * from default.withdraw where id like '%withdraw%cash';
>>> OK
>>> Time taken: 0.066 seconds
>>>
>>>
>>> *--3) **result = success*
>>>
>>> hive> select * from default.withdraw where id like '%cash%';
>>> OK
>>> withdraw
>>> cash
>>> Time taken: 0.086 seconds, Fetched: 1 row(s)
>>>
>>>
>>>
>>> *-- Presto Queries
>>> -*
>>> FYI - Presto (v0.221) is using above table meta store. We tested above
>>> queries on presto too.
>>>
>>> *--1)  **result = **success*
>>>
>>> presto> select * from default.withdraw where id like '%withdraw%';
>>>id
>>> --
>>> withdraw
>>> cash
>>> (1 row)
>>>
>>>
>>> *--2) **result = **success*
>>>
>>> presto> select * from default.withdraw where id like '%withdraw%cash';
>>>id
>>> --
>>> withdraw
>>> cash
>>> (1 row)
>>>
>>>
>>> *--3) **result = **success*
>>>
>>> presto> select * from default.withdraw where id like '%cash%';
>>>id
>>> --
>>> withdraw
>>> cash
>>> (1 row)
>>>
>>> *-- *
>>> *-- *
>>>
>>> Please help here in case i am missing anything.
>>>
>>> regards,
>>> shankar
>>>
>>>
>>>


Announce: MR3 0.9 released

2019-07-25 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.9. New features are:

* LLAP I/O works properly on Kubernetes.
https://mr3.postech.ac.kr/hivek8s/features/llap-io/

* SSL(Secure Sockets Layer) is fully supported on Kubernetes. Now one can
run HiveServer2, Ranger, TimelineServer, Metastore securely.
https://mr3.postech.ac.kr/hivek8s/guide/enable-ssl/

* Hive on Kubernetes can read from multiple HDFS sources.
https://mr3.postech.ac.kr/hivek8s/guide/nonsecure-hdfs/

* Multiple HiveServer2 instances, each with its own Metastore, can share
DAGAppMaster and the ContainerWorker pool (thus simulating a serverless
environment.) https://mr3.postech.ac.kr/hivek8s/guide/multiple-metastores/

* UDFs work okay on Kubernetes.

You can download MR3 0.9 at:

  https://mr3.postech.ac.kr/download/home/

--- Sungwoo Park


Re: Announce: MR3 0.8 released

2019-06-28 Thread Sungwoo Park
https://youtu.be/1NB7GtI8NXM

 I have uploaded a video demonstrating Hive on Kubernetes using MR3.

--- Sungwoo

On Fri, Jun 28, 2019 at 4:44 AM Sungwoo Park  wrote:

> I have created a quick start guide showing how to run Hive-MR3 on
> Kubernetes using Minikube on a single machine. If you are interested in
> trying Hive on Kubernetes on your laptop, please check out this page:
>
> https://mr3.postech.ac.kr/quickstart/hivek8s/run-k8s/
>
> --- Sungwoo
>
> On Wed, Jun 26, 2019 at 7:56 PM Sungwoo Park  wrote:
>
>> I am pleased to announce the release of MR3 0.8. New features are:
>>
>> -- Hive on MR3 on Yarn fully supports recovery:
>> https://mr3.postech.ac.kr/hivemr3/features/recovery/
>>
>> -- Hive on MR3 on Yarn supports high availability in which multiple
>> HiveServer2 instances share a common DAGAppMaster (and a common pool of
>> ContainerWorkers):
>> https://mr3.postech.ac.kr/hivemr3/features/high-availability/
>>
>> -- Hive on Kubernetes supports Apache Ranger and Timeline Server:
>> https://mr3.postech.ac.kr/hivek8s/guide/run-timeline/
>> https://mr3.postech.ac.kr/hivek8s/guide/run-ranger/
>>
>> From the release notes:
>>
>> A new DAGAppMaster properly recovers DAGs that have not been completed in
>> the previous DAGAppMaster.
>> Fault tolerance after fetch failures works much faster.
>> On Kubernetes, the shutdown handler of DAGAppMaster deletes all running
>> Pods.
>> On both Yarn and Kubernetes, MR3Client automatically connects to a new
>> DAGAppMaster after an initial DAGAppMaster is killed.
>> Hive 3 for MR3 supports high availability on Yarn via ZooKeeper.
>> On both Yarn and Kubernetes, multiple HiveServer2 instances can share a
>> common MR3 DAGAppMaster (and thus all its ContainerWorkers as well).
>> Hive on Kubernetes supports Apache Ranger.
>> Hive on Kubernetes supports Timeline Server.
>>
>> You can download MR3 0.8 at:
>>
>>   https://mr3.postech.ac.kr/download/home/
>>
>> --- Sungwoo Park
>>
>


Hive-LLAP in HDP 3.1.4 vs Hive 3/4 on MR3 0.10

2019-11-03 Thread Sungwoo Park
I have published a new article that compares: Hive-LLAP in HDP 3.1.4, Hive
3.1.2 on MR3 0.10, and Hive 4.0.0-SNAPSHOT on MR3 0.10. You can find the
result at:

https://mr3.postech.ac.kr/blog/2019/11/03/hive-performance-0.10/

Cheers,

--- Sungwoo


Re: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

2019-11-06 Thread Sungwoo Park
For the problem of not returning the result to the console, I think it
occurs because the default file system is set to local file system, not to
HDFS. Perhaps hive.exec.scratchdir is already set to /tmp/hive, but if the
default file system is local, FileSinkOperator writes the final result to
the local file system of the container where it is running. Then
HiveServer2 tries to read from a subdirectory under /tmp/hive of its own
local file system, thus returning an empty result. (The query 'select *
from ...' works okay because it is taken care of by HiveServer2 itself.)

I can think of two solutions: 1) set the default file system to HDFS (e.g.,
by updating core-site.xml); 2) embed the file system directly into
hive.exec.scratchdir (e.g., by setting it to hdfs://tmp/hive).

--- gla

On Thu, Nov 7, 2019 at 3:12 AM Aaron Grubb 
wrote:

> Hello all,
>
>
>
> I'm running a from-scratch cluster on AWS EC2. I have an external table
> (partitioned) defined with data on S3. I'm able to query this table and
> receive results to the console with a simple select * statement:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> select * from external_table where partition_1='1' and
> partition_2='2';
>
> [correct results returned]
>
>
> 
>
>
>
> Running a query that requires Tez doesn't return the results to the
> console:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> select count(*) from external_table where partition_1='1' and
> partition_2='2';
>
> Status: Running (Executing on YARN cluster with App id
> application_1572972524483_0012)
>
>
>
> OK
>
> +--+
>
> | _c0 |
>
> +--+
>
> +--+
>
> No rows selected (8.902 seconds)
>
>
> 
>
>
>
> However, if I dig in the logs and on the filesystem, I can find the
> results from that query:
>
>
>
>
> 
>
> (yarn.resourcemanager.log)
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1572972524483_0022
> CONTAINERID=container_1572972524483_0022_01_02 RESOURCE= vCores:1> QUEUENAME=default
>
> (container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New
> Final Path: FS file:/tmp/[REALLY LONG FILE PATH]/00_0
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
>
> SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒
> j¹▒ 2060
>
>
> 
>
>
>
> 2060 is the correct count for the partition.
>
>
>
> Now, oddly enough, I'm able to get the results from the application if I
> insert overwrite directory on HDFS:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from
> external_table where partition_1='1' and partition_2='2';
>
> [root #] hdfs dfs -cat /tmp/local_out/00_0
>
> 2060
>
>
> 
>
>
>
> However, attempting to insert overwrite local directory fails:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*)
> from external_table where partition_1='1' and partition_2='2';
>
> [root #] cat /tmp/local_out/00_0
>
> cat: /tmp/local_out/00_0: No such file or directory
>
>
> 
>
>
>
> If I cat the container result file for this query, it's only the number,
> no class name or special characters:
>
>
>
>
> 
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
>
> 2060
>
>
> 
>
>
>
> The only out-of-place log message I can find comes from the YARN
> ResourceManager log:
>
>
>
>
> 
>
> (yarn.resourcemanager.log) INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container 

Announce: MR3 0.11 released

2019-12-04 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.11. The new feature in MR3
0.11 is the support for autoscaling. Autoscaling works on both Hadoop and
Kubernetes.

I have added a quick start guide on running Hive on MR3 on Amazon EKS with
autoscaling. Autoscaling on Amazon EKS (which uses Kubernetes Autoscaler)
is much faster than on Amazon EMR. For example, on EKS, a scale-out
operation takes about 2 minutes to attach a new node, and a scale-in
operation takes slightly longer than 1 minute to retire an existing node.
In contrast, on EMR, the user should wait at least 5 minutes to trigger
scale-in/out. So if you would like to try autoscaling with Hive on MR3, we
suggest EKS instead of EMR.

https://mr3.postech.ac.kr/quickstart/aws/run-eks-autoscaling/

You can download MR3 0.11 at:

https://mr3.postech.ac.kr/download/home/

Cheers,

--- Sungwoo Park


Re: Hive 1.1.0 support on hive metastore 2.3.0

2019-12-09 Thread Sungwoo Park
I didn't try to run multiple versions of Hive on the same cluster. If your
installation of Hive uses Tez installed on the Hadoop system, I guess
running multiple versions of Hive might not be easy because different
versions of Hive use different versions of Tez (especially if you want to
run Hive-LLAP). For me, I use Hive on MR3 (well, because we created MR3).
You can run all of Hive 1, 2, 3 concurrently on the same cluster. And there
is no installation process -- just unpack the tar balls and configure
files, and you are all set.

--- Sungwoo

On Mon, Dec 9, 2019 at 6:27 PM Priyam Gupta  wrote:

> Thanks Park for sharing the tests that you did. I will try out for the
> specific version for my use cases.
>
> Is there any better way/approach where we can have a single meta store and
> can launch multiple hive clusters of different versions point to the same
> metastore db.
>
> Thanks.
>
>
>
>
> On Mon, Dec 9, 2019 at 12:46 PM Sungwoo Park  wrote:
>
>> Not a definitive answer, but my test result might help. I tested with
>> HiveServer2 1.2.2 and Metastore 2.3.6. Queries in the TPC-DS benchmark
>> (which only read data and never update) run okay. Creating new tables and
>> loading data to tables also work okay. So, I guess for basic uses of Hive,
>> running HiveServer2 1.2 against Metastore 2.3 should be fine.
>>
>> What I am not sure about is whether queries involving transactions work
>> okay when HiveServer2 1.2 connects to Metastore 2.3. Please note that I
>> have not tested with Hive 1.1. Perhaps those familiar with internals of
>> Metastore in this mailing list could give more accurate answers.
>>
>> As a side note, HiveServer2 2.3 can connect to Metastore 3.1. However,
>> HiveServer2 1.2 seems to have a problem with connecting to Metastore 3.1.
>>
>> Cheers,
>>
>> --- gla
>>
>> On Mon, Dec 9, 2019 at 2:38 PM Priyam Gupta  wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> In order to have single hive metastore supporting different clusters of
>>> hive, I have upgraded my hive metastore schema (on postgres) from version
>>> 1.1.0  to version 2.3.0. Will hive cluster on version 1.1.0 work if I point
>>> it to upgraded metastore db which is as per hive version 2.3.0.
>>>
>>>
>>> Thanks,
>>> Priyam
>>>
>>


Re: Hive 1.1.0 support on hive metastore 2.3.0

2019-12-08 Thread Sungwoo Park
Not a definitive answer, but my test result might help. I tested with
HiveServer2 1.2.2 and Metastore 2.3.6. Queries in the TPC-DS benchmark
(which only read data and never update) run okay. Creating new tables and
loading data to tables also work okay. So, I guess for basic uses of Hive,
running HiveServer2 1.2 against Metastore 2.3 should be fine.

What I am not sure about is whether queries involving transactions work
okay when HiveServer2 1.2 connects to Metastore 2.3. Please note that I
have not tested with Hive 1.1. Perhaps those familiar with internals of
Metastore in this mailing list could give more accurate answers.

As a side note, HiveServer2 2.3 can connect to Metastore 3.1. However,
HiveServer2 1.2 seems to have a problem with connecting to Metastore 3.1.

Cheers,

--- gla

On Mon, Dec 9, 2019 at 2:38 PM Priyam Gupta  wrote:

>
>
> Hi,
>
> In order to have single hive metastore supporting different clusters of
> hive, I have upgraded my hive metastore schema (on postgres) from version
> 1.1.0  to version 2.3.0. Will hive cluster on version 1.1.0 work if I point
> it to upgraded metastore db which is as per hive version 2.3.0.
>
>
> Thanks,
> Priyam
>


Announce: MR3 0.10 released

2019-10-18 Thread Sungwoo Park
I am pleased to announce the release of MR3 0.10. New features are:

* TaskScheduler supports a new scheduling policy (specified by
mr3.taskattempt.queue.scheme) which significantly improves the throughput
for concurrent queries. From our internal experiments using TPC-DS
benchmarks, the new scheduling policy increases the throughput 20 to 30
percent at concurreny level 16 to 32. Now Hive on MR3 (based on Hive 3.1.2)
delivers almost twice the throughput of Hive-LLAP included in HDP 3.

* Compaction sends DAGs to MR3, instead of MapReduce, when
hive.mr3.compaction.using.mr3 is set to true. As it no longer depends on
MapReduce for compaction, Hive on MR3 supports compaction on Kubernetes as
well.

* Helm charts are supported on Kubernetes.

* MR3 supports Hive 3.1.2 and 2.3.6. It also supports Hive 4.0.0-SNAPSHOT.

* Scritps for running HPL/SQL are included, both for Hadoop and Kuberentes.

* LlapDecider and ConvertJoinMapJoion directly asks MR3 DAGAppMaster for
the current number of Task slots (similar to executors in Hive-LLAP) and
nodes.

* DAGAppMaster recovers from OutOfMemoryErrors due to the exhaustion of
threads.

You can download MR3 0.10 at:

https://mr3.postech.ac.kr/download/home/

Cheers,

--- Sungwoo Park


All-in-One Docker image for running Hive on MR3 + Ranger + Timeline Server

2019-10-18 Thread Sungwoo Park
We have built a Docker image for running all of Hive on MR3 (based on Hive
3.1.2), Ranger 2.0, and Timeline Server in Hadoop 2.7.7. The Docker image
should be useful for quickly testing Hive features, playing with Ranger,
and visualizing DAGs in MR3, etc, on a single machine.

For creating containers from the Docker image, see:

https://mr3.postech.ac.kr/quickstart/docker/run-docker/

For the instruction on building the Docker image, see:

https://mr3.postech.ac.kr/quickstart/docker/build-docker/

See the Docker Hub repository for a pre-built image
(glaparkdocker/hivemr3-all).

https://hub.docker.com/u/glaparkdocker

Cheers,

--- Sungwoo Park


Re: How to decide Hive Cluster capacity

2019-12-21 Thread Sungwoo Park
I think this problem of choosing a cluster capacity is really challenging
because the desired cluster capacity depends not only on the size of the
dataset but also on the complexity of queries. For example, the execution
time of the TPC-DS queries on the same dataset can range from sub-10
seconds to thousands of seconds. Moreover, the desired cluster capacity may
fluctuate over time. For example, one may want a large cluster during busy
hours, but a small cluster at night. So, I think it is case-by-case and
depends on the size of the dataset, types of queries executed, and the
amount of workload in terms of the number of concurrent queries, and so on.

Because of the difficulty of choosing the right cluster capacity and
unpredictability of workload, I think people are looking for solutions with
autoscaling on public clouds, where the cluster capacity increases and
decreases automatically. I guess most of the commercial solutions offered
on public clouds support autoscaling in one way or another.

--- Sungwoo


On Wed, Dec 18, 2019 at 2:40 AM Sai Teja Desu <
saiteja.d...@globalfoundries.com> wrote:

> Hello All,
>
> I'm looking for a methodology on what basis we should decide the cluster
> capacity for Hive.
>
> Can anyone recommend best practices to choose a cluster capacity for
> querying data efficiently in Hive. Please note that, we have external
> tables in Hive pointing to S3, so we just use Hive for querying the data.
>
> *Thanks,*
> *Sai.*
>


Re: rename output error during hive query on AWSs3-external table

2020-02-04 Thread Sungwoo Park
Not a solution, but looking at the source code of S3AFileSystem.java
(Hadoop 2.8.5), I think the Exception raised inside S3AFileSystem.rename()
is swallowed and only a new HiveException is reported. So, in order to find
out the root cause, I guess you might need to set Log level to DEBUG and
see what Exception is raised inside S3AFileSystem.innerRename() (which is
called from rename()). S3AFileSystem.innerRename() is implemented as
copy-and-delete, so nasty things could happen inside the method.

Related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-20517

--- gla



On Tue, Feb 4, 2020 at 5:05 PM Souvikk Roy  wrote:

> Hello,
>
> We are using some external tables backed by aws S3. And we are
> intermittently getting this error, most likely at the last stage of the
> reduce, I see some similar posts in net but could not find any solution, Is
> there any way yo solve it:
>
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output
> from:
> s3a://xxx/pn/.hive-staging_hive_2020-01-27_08-05-54_195_618381349208890136-176126/_task_tmp.-ext-1/year=2019/product=xxx/abc=2019-12-16/xxx=BCD/yz=x/pattern=xxx/_tmp.01_0
> to:
> s3a://xxx/tables/xxx//xxx/xx//xx/.hive-staging_hive_2020-01-27_08-05-54_195_6
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.commit(FileSinkOperator.java:236)
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.access$200(FileSinkOperator.java:133)
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1014)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
> at
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>
> thanks
> Souvik
>


Announce: MR3 1.0 released

2020-02-19 Thread Sungwoo Park
MR3 1.0 has been released by DataMonad. The documentation on MR3 is now at
MR3docs, and the old site will no longer be updated.

https://www.datamonad.com/
https://mr3docs.datamonad.com/

With the release of MR3 1.0, I have written an article on testing MR3. It's
a long article, but could be useful to those interested in trying MR3 in
production.

https://www.datamonad.com/post/2020-02-19-testing-mr3/

Cheers,

--- Sungwoo Park


Re: Issues with aggregating on map values

2020-02-21 Thread Sungwoo Park
I tested the example on Hive 2.3.6, and it returned correct results. Hive
3.1.2 and 4.0.0-SNAPSHOT also returned correct results. So, I guess, if
this is a bug, it was introduced somewhere around Hive 3.0 and fixed in
3.1.2.

On Hive 2.3.6, I used these commands instead:
create table dummy(a string); insert into table dummy values ('a');
insert into foo select map("country", "USA"), 10 from dummy;
insert into foo select map("country", "USA"), 20 from dummy;
insert into foo select map("country", "UK"), 30 from dummy;

Cheers,

--- Sungwoo


On Wed, Feb 12, 2020 at 10:44 PM Nakul Khanna (BLOOMBERG/ LONDON) <
nkhann...@bloomberg.net> wrote:

> Hey Zoltan,
>
> Thanks for the response. When I call "select version()" I get:
>
> 3.1.0.3.1.4.0-315 re8d79f440455fa4400daf79974666b3055f1730f
>
> So a couple of patch versions old - any idea if this was a known bug
> before?
>
> Regards,
> Nakul
>
> From: user@hive.apache.org At: 02/12/20 12:31:33
> To: Nakul Khanna (BLOOMBERG/ LONDON ) ,
> user@hive.apache.org
> Cc: Jacky Lee (BLOOMBERG/ PRINCETON ) , He Chen
> (BLOOMBERG/ PRINCETON ) , Peter Babinski
> (BLOOMBERG/ PRINCETON ) , Bernat Gabor
> (BLOOMBERG/ LONDON ) , Shashank Singh (BLOOMBERG/
> PRINCETON ) 
> Subject: Re: Issues with aggregating on map values
>
> Hey Nakul!
>
> It's not clear which version you are using;
> I've checked this issue on apache/master and the 3.1.2 release - and both
> of
> them returned accurate results.
> You could execute: 'select version()' ; or run 'hive --version' in a
> commandline
>
> cheers,
> Zoltan
>
> On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote:
> > Creating the table
> >
> > CREATE TABLE foo
> > (tags MAP, size int);
> > INSERT INTO foo VALUES
> > (map("country", "USA"), 10),
> > (map("country", "USA"), 20),
> > (map("country", "UK"), 30);
> >
> > SELECT * FROM foo
> >
> > ++---+
> > | foo.tags | foo.size |
> > ++---+
> > | {"country":"USA"} | 10 |
> > | {"country":"USA"} | 20 |
> > | {"country":"UK"} | 30 |
> > ++---+
> >
> > Aggregating the Table
> >
> > SELECT DISTINCT tags["country"] from foo;
> >
> > +---+
> > | _c0 |
> > +---+
> > | USA |
> > | NULL |
> > +---+
> >
> > SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];
> >
> > +---+--+
> > | _c0 | _c1 |
> > +---+--+
> > | USA | 10 |
> > | NULL | 50 |
> > +---+--+
> >
> > And even more strangely, with a subquery:
> >
> > SELECT flattened.country, sum(flattened.size)
> > FROM (
> > SELECT tags["country"] as country, size
> > FROM foo
> > WHERE tags["country"] IS NOT NULL
> > ) as flattened
> > GROUP BY flattened.country;
> >
> > ++--+
> > | flattened.country | _c1 |
> > ++--+
> > | USA | 10 |
> > ++--+
> >
> > ---
> >
> > Is there any way to stop this from happening and get the correct
> aggregation
> behaviour? The only method I've found is to create a new table using the
> query,
> write that to
> > disk and then do the aggregation on that.
>
>
>


Re: UDF get_splits()

2020-04-05 Thread Sungwoo Park
Hello Eric,

Now I understand why get_splits() returns LlapInputSplit[]. It seems that
get_splits() is not part of public API as there is no documentation about
it (and the user cannot do anything useful using the result of get_splits()
anyway).

Thank you very much for your email and saving me a lot of time!

--- Sungwoo

On Mon, Apr 6, 2020 at 4:38 AM Eric Wohlstadter  wrote:

> Hi Sungoo,
>  get_splits() is used by the HiveWarehouseConnector, you can see an
> example here:
> https://github.com/hortonworks/hive-warehouse-connector/blob/HDP-3.1.5.20-4-tag/src/main/java/com/hortonworks/spark/sql/hive/llap/HiveWarehouseDataSourceReader.java
>
> That being said, I'm not sure if this UDF is technically supported as a
> public API by the Hive community, so you may want to check about that.
>
> Eric
>
> On Sun, Apr 5, 2020 at 11:52 AM Sungwoo Park  wrote:
>
>> Hello,
>>
>> I would like to learn the use of UDF get_splits(). I tried such queries
>> as:
>>
>> select get_splits("select * from web_returns", 1) ;
>> select get_splits("select count(*) from web_returns", 1);
>>
>> These queries just return InputSplit objects, and I would like to see an
>> example that uses the result of calling get_splits(). If someone could
>> share a practical example using get_splits(), I would appreciate it very
>> much.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>


UDF get_splits()

2020-04-05 Thread Sungwoo Park
Hello,

I would like to learn the use of UDF get_splits(). I tried such queries as:

select get_splits("select * from web_returns", 1) ;
select get_splits("select count(*) from web_returns", 1);

These queries just return InputSplit objects, and I would like to see an
example that uses the result of calling get_splits(). If someone could
share a practical example using get_splits(), I would appreciate it very
much.

Thanks,

--- Sungwoo


Re: Count bug in Hive 3.0.0.3.1

2020-04-28 Thread Sungwoo Park
I have tested the script with Hive 2.3.6, Hive 3.1.2, and Hive
4.0.0-SNAPSHOT (all with minor modifications), and have not found any
problem. So, I guess all the master branches are fine.

If Hive 3.0.0.3.1 is the release included in HDP 3.0.0 or HDP 3.0.1, I
remember that this Hive-LLAP/Tez release was not stable. So, it could be a
problem specific to the release in HDP 3.0.0/3.0.1.

--- Sungwoo

On Tue, Apr 28, 2020 at 4:01 PM Peter Vary  wrote:

> Hi Deepak,
>
> If I were you, I would test your repro case on the master branch.
>
>- If it is fixed, I think you should try to find the fix which solves
>the problem and cherry-pick the fix to branch-3 and branch-3.1 so the fix
>is there in the next release.
>- If the problem is still present on the master branch, then take a
>look at this page
>https://cwiki.apache.org/confluence/display/Hive/HowToContribute This
>describes the development method for Hive.
>
>
> Thanks,
> Peter
>
> On Apr 28, 2020, at 07:22, Deepak Krishna  wrote:
>
> Hi team,
>
> We came across a bug related to count function. We are using hive
> 3.0.0.3.1 with Tez 0.9.0.3.1. PFA the queries to replicate the issue.
>
> Please register this as a bug and let us know if we can support in anyway
> to fix the issue. It would also be helpful to know if there are any other
> workarounds for this issue.
>
> Thanks and Regards,
> Deepak Krishna
>
>
>
>
> Deepak Krishna
> Big Data Engineer
>
> [image: Tel.]
> [image: Fax] +49 721 98993-
> [image: E-Mail] hs-d...@solute.de
>
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de  
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
> 
>
>
>


Question on metadata before and after compaction

2020-10-08 Thread Sungwoo Park
Hi,

I have a question on the consistency between data (e.g., on HDFS) and
metadata kept by Metastore before and after compaction.

Here is a scenario:

1. We back up the database for Metastore (before performing compaction).
2. We perform compaction.
3. After performing compaction, we lose the database for Metastore by
accident.

The question is: can we restore and use the backup database for Metastore
along with the data after compaction? As incremental changes/deltas are
stored on data storage, I guess this might be feasible, but not sure. Any
comment would be appreciated.

Thanks,

--- Sungwoo


Video demo of fault tolerance in Hive on MR3 on Kubernetes

2020-07-29 Thread Sungwoo Park
Hi everyone,

We created a video demo of fault tolerance in Hive on MR3 on Kubernetes,
using Hive 3.1.2 and MR3 1.1. Hope you enjoy it!

https://youtu.be/uoZGsMUlhew

Cheers,

--- Sungwoo


Re: Hive metastore

2020-07-14 Thread Sungwoo Park
Hello,

We use just TCP readiness/liveness probes checking the Metastore listener
port (specified by hive.metastore.port or metastore.thrift.port). I don't
know if an HTTP endpoint is available for Metastore.

readinessProbe:
  tcpSocket:
port: 9083
  initialDelaySeconds: 10
  periodSeconds: 20
livenessProbe:
  tcpSocket:
port: 9083
  initialDelaySeconds: 20
  periodSeconds: 20

Best,

--- Sungwoo

On Wed, Jul 15, 2020 at 2:23 AM Eric Pogash  wrote:

> Ping on this, does anyone know of a health endpoint?
>
> Eric
>
>
> On Wed, Jul 8, 2020 at 3:04 PM Eric Pogash  wrote:
>
>> Hello,
>>
>> I'm looking to establish a readiness and liveness probe in kubernetes
>> where we are hosting a hive standalone metastore. Is there an http health
>> endpoint available for the standalone hive metastore that I can use? If
>> not, what is the recommended approach here?
>>
>> Best,
>> Eric Pogash
>>
>


MR3 1.1 released

2020-07-19 Thread Sungwoo Park
We are pleased to announce the release of MR3 1.1. Three main improvements
in MR3 1.1 are:

1. Hive on MR3 on Kubernetes now runs almost as fast as Hive on MR3 on
Hadoop. For experimental results, please see a new blog article "Why you
should run Hive on Kubernetes, even in a Hadoop cluster".

https://www.datamonad.com/post/2020-07-19-why-hive-k8s/

2. Fetch delays rarely occur thanks to: 1) speculative execution and 2)
support for multiple shuffle handlers in a single ContainerWorker. For
experimental results, please see the following page in MR3docs:

https://mr3docs.datamonad.com/docs/mr3/features/fetchdelay/

3. Hive 4 on MR3 now runs stable (after applying HIVE-23114 which fixes a
design bug reported by the MR3 team).

Please visit https://mr3docs.datamonad.com/ for the full documentation on
MR3. The quick start guide for Hive on MR3 on Kubernetes has been updated
for MR3 1.1 (https://mr3docs.datamonad.com/docs/quick/).

For Hive on MR3 on Kubernetes, one can quickly test with a pre-built Docker
image available at DockerHub (https://hub.docker.com/u/mr3project) and the
scripts available at GitHub (https://github.com/mr3project/mr3-run-k8s/).

Thank you!

--- Sungwoo

=== Features and improvements in MR3 1.1

## MR3

Mapping DAGs to Task queues:
https://mr3docs.datamonad.com/docs/mr3/features/dag-scheduling/
Multiple shuffle handlers in a single ContainerWorker:
https://mr3docs.datamonad.com/docs/mr3/features/shufflehandler/
Speculative execution:
https://mr3docs.datamonad.com/docs/mr3/features/speculative/
Eliminating fetch delays:
https://mr3docs.datamonad.com/docs/mr3/features/fetchdelay/
Running shuffle handlers in a separate process on Kubernetes:
https://mr3docs.datamonad.com/docs/mr3/guide/use-shufflehandler/

## Hive on MR3 on Kubernetes

Fast recovery in Hive on MR3 on Kubernetes:
https://mr3docs.datamonad.com/docs/k8s/features/recovery/
Using HDFS instead of PersistentVolumes:
https://mr3docs.datamonad.com/docs/k8s/advanced/use-hdfs/
Using Amazon S3 instead of PersistentVolumes:
https://mr3docs.datamonad.com/docs/k8s/advanced/use-s3/
Configuring kernel parameters in Docker containers:
https://mr3docs.datamonad.com/docs/k8s/advanced/configure-kernel/
Performance tuning for Hive on MR3 on Kubernetes:
https://mr3docs.datamonad.com/docs/k8s/advanced/performance-tuning/
Using S3 instead of EFS on Amazon EKS:
https://mr3docs.datamonad.com/docs/k8s/eks/use-s3/

=== Release notes for MR3 1.1

## MR3

  - Support DAG scheduling schemes (specified by `mr3.dag.queue.scheme`).
  - Optimize DAGAppMaster by freeing memory for messages to Tasks when
fault tolerance is disabled (with `mr3.am.task.max.failed.attempts` set to
1).
  - Fix a minor memory leak in DaemonTask (which also prevents MR3 from
running more than 2^30 DAGs when using the shuffle handler).
  - Improve the chance of assigning TaskAttempts to ContainerWorkers that
match location hints.
  - TaskScheduler can use location hints produced by `ONE_TO_ONE` edges.
  - TaskScheduler can use location hints from HDFS when assigning
TaskAttempts to ContainerWorker Pods on Kubernetes (with `
mr3.convert.container.address.host.name`).
  - Introduce `mr3.k8s.pod.cpu.cores.max.multiplier` to specify the
multiplier for the limit of CPU cores.
  - Introduce `mr3.k8s.pod.memory.max.multiplier` to specify the multiplier
for the limit of memory.
  - Introduce `mr3.k8s.pod.worker.security.context.sysctls` to configure
kernel parameters of ContainerWorker Pods using init containers.
  - Support speculative execution of TaskAttempts (with
`mr3.am.task.concurrent.run.threshold.percent`).
  - A ContainerWorker can run multiple instances of shuffle handlers each
with a different port. The configuration key
`mr3.use.daemon.shufflehandler` now specifies the number of shuffle handler
instances in each ContainerWorker.
  - With speculative execution and multiple instances of shuffle handlers
in a single ContainerWorker, fetch delays rarely occur.
  - A ContainerWorker Pod can run shuffle handlers in a separate container
(with `mr3.k8s.shuffle.process.ports`).
  - On Kubernetes, DAGAppMaster uses ReplicationController instead of Pod,
thus making recovery much faster.
  - On Kubernetes, ConfigMaps `mr3conf-configmap-master` and
`mr3conf-configmap-worker` survive MR3, so the user should delete them
manually.
  - Java 8u251/8u252 can be used on Kubernetes 1.17 and later.

## Hive on MR3 on Hadoop

  - CrossProductHandler asks MR3 DAGAppMaster to set
TEZ_CARTESIAN_PRODUCT_MAX_PARALLELISM (Cf. HIVE-16690, Hive 3/4).
  - Hive 4 on MR3 is stable (currently using 4.0.0-SNAPSHOT).
  - No longer support Hive 1.

## Hive on MR3 on Kubernetes

  - Ranger uses a local directory (emptyDir volume) for logging.
  - The open file limit for Solr (in Ranger) is not limited to 1024.
  - HiveServer2 and DAGAppMaster create readiness and liveness probes.


MR3 1.2 released

2020-10-29 Thread Sungwoo Park
Hello Hive users,

MR3 1.2 has been released. A few improvements in this release are:

1. MR3 can publish Prometheus metrics.
2. On Kubernetes, the user can change the total resources for workers
dynamically (e.g., by using Prometheus metrics). This feature can be
combined with autoscaling in order to further improve resource utilization.
3. Hive on MR3 can run on AWS Fargate. This can be useful when running
queries sporadically.
4. LLAP I/O with NVMe caching works properly, which is particularly useful
on Amazon EKS.
5. Ranger 2.0.0 and 2.1.0 are supported.

For details of changes, please see the release notes in MR3docs:

https://mr3docs.datamonad.com/docs/release/

--- Sungwoo


Re: Maintaining Hive 2 and 3 branches,

2021-03-18 Thread Sungwoo Park
Hi Peter,

- Are these patches you mention below bugfixes, or new features on Hive
> 3.1.3? (This might be a typo as I think the last Hive release is 3.1.2)
>

They are a collection of bug-fixes and improvements picked up from
master/branch-3 branches. The list is mostly based on the additional
commits found in HDP 3.1.5 and Qubole Hive 3 relative to Hive 3.1.2. I
mistakenly mentioned Hive 3.1.3 because it applies up to the last few
commits in branch-3.1 which set Hive version to 3.1.3 in pom.xml. We could
think of the list as about 210 commits applied to Hive 3.1.2.


> - Could you backport these patches to the apache branch-3, and branch-3.1?
>
- Is there any reason not to?
>

I could backport these patches to branch-3.1, but I can think of two
potential problems. For branch-3, we need a separate list of commits.

1) Stability of the branch after applying these patches.
As ordinary users of Hive, we cannot convince ourselves that the patches
can be applied safely because we don't have definitive criteria like
"everything is okay if the code passes these tests". So I think either we
should be given such criteria or someone else (Hive committers) should
manually inspect individual patches and test results again.

2) As our repo is essentially a fork from Hive 3.1.2, we cannot apply these
patches to branch-3.1 in their current form.


> I am asking this because I think the best way to move forward is to
> consolidate these backports to a single repo, preferably to the apache one,
> so everyone can benefit from it.
>

Indeed. I hope we will figure out how to make progress for this problem.

Thanks,

--- Sungwoo


Maintaining Hive 2 and 3 branches,

2021-03-18 Thread Sungwoo Park
Hello Hive users,

After attending the Hive meetup yesterday (huge thanks to the organizers!),
I thought that perhaps many organizations were maintaining their own Hive 2
and 3 branches by backporting important patches to vanilla Hive. Ideally it
would be great if all the important patches were regularly merged to Hive 2
and 3 branches (e.g., branch-2.3 and branch-3.1), but I guess this would
take a lot of time and effort on the Hive committer side, and it also seems
like at the moment, most of the efforts are directed at the master branch.

I find this process of backporting patches to Hive 2 and 3 branches to be
quite a challenge and time-consuming, especially to those "outsiders" who
have not implemented/reviewed the patches. The problem is two-fold: 1) you
have to decide what patches to apply and in what order; 2) you have to run
all the tests to make sure that new patches are compatible with the code
base and do not introduce new bugs.

1) is not easy because sometimes a patch from the master branch fails to
merge because of missing dependencies. In such a case, you have to go back
to the history of commits, identify those dependency commits, and merge
them first. Depending on the level of changes made in the patch, this can
be a big pain.

2) can be also a problem if applying a new patch produces different test
results. Sometimes a patch is merged with no conflicts, but some tests
fail. Besides it may take a lot of time to run tests themselves.

So, I wonder if anyone could share their experience and wisdom on how to
maintain Hive 2 and 3 branches, or share their git repos. For us, we have
applied about 210 patches to Hive 3.1.3 (since Nov 2, 2020), and are in the
middle of applying additional 100+ patches. You can find our work at the
following repo. (You can ignore the last commit which is internal to our
work.)

https://github.com/mr3project/hive-mr3/commits/master3

Thanks,

--- Sungwoo Park


MR3 1.3 released

2021-08-18 Thread Sungwoo Park
We are pleased to announce the release of MR3 1.3.

Highlights in this release are:

1) MR3
On both Hadoop and Kubernetes, there is no limit on the aggregate memory of
ContainerWorkers, so MR3 can run in a cluster of any size.

2) Hive on MR3
We have backported about 350 patches to Apache Hive 3.1.2.

3) Spark on MR3
Spark on MR3 is a new major application of MR3. For an introduction to
Spark on MR3, please see our blog article:
  https://www.datamonad.com/post/2021-08-18-spark-mr3/

For more details, please see the release notes:
  https://mr3docs.datamonad.com/docs/release/

--- Sungwoo


Re: Future release of hive

2021-09-16 Thread Sungwoo Park
Hello,

Hive PMC members or committers could share insider knowledge about the
status of the Hive project, but here is my impression on Hive 3.1.2 as an
outsider.

Hive 3.1.2 is widely used in production, but not maintained seriously. (You
could just check out the # of commits in branch-3.1 for the last couple of
years). Many critical patches have not been backported, and some patches
are committed even without proper testing. As a result, running Hive 3.1.2
in production would require you to maintain your own fork of Hive 3.1.2,
backporting patches as necessary. Or you could use a commercial solution
like CDP. There is nothing unusual here, as Hive is an open source project.

On the other hand, bugs and performance issues in Hive 3.1.2 are constantly
reported in Hive JIRAs, while important bug-fixes and performance
improvements are contributed by many individuals. What seems to happen
afterwards is that the contributed code is either merged only in the master
branch or not accepted at all. (I see quite a few important patches stay
unnoticed without being discussed.) Occasionally you see new Hive JIRAs
reporting bugs in Hive 3.1.2 which have actually been fixed in earlier
JIRAs that are not merged in branch-3.1. In order to take advantage of new
patches, one would have to backport batches on his own. (I guess Hive PMC
is mostly focused on the master branch).

As for Hive 4.0, I know nothing about its status, but in the virtual meetup
last March, it was briefly mentioned that no concrete release plan was
ready. (I could be wrong, so someone could correct me.)

We are maintaining our own fork of Hive 3.1.2 which backported over 300
additional patches. More important patches from the master branch are
currently being backported. This repository is getting increasingly
popular, so it might be useful to you.

https://github.com/mr3project/hive-mr3

For dealing with the difficulty of operating Hive 3.1.2, there is Hive on
MR3 - no need to configure LLAP daemons, little dependence on the Hadoop
version, as fast as Hive-LLAP, and so on. Quickstart guides are available (
https://mr3docs.datamonad.com/docs/quick/hadoop/), and tutorials will be
published in the next release. If you can use Kubernetes, you can easily
run Hive on MR3 with Ranger 2.1.0/2.0.0 (
https://mr3docs.datamonad.com/docs/k8s/guide/).

Disclaimer: I am the main developer of MR3.

--- Sungwoo

On Tue, Sep 14, 2021 at 9:43 PM Antoine DUBOIS 
wrote:

> Hello
> After trying to use hive 3.1.2 for several weeks with ranger, I stop.
> It's seems way too complicated and tedious.
> I wonder when or even if there will be any more release in the 3.0 branch.
> I wonder if Hive 3.0 was just an experience as it seems maintenance is not
> really there.
> Is there any plan for Hive 4.0 or should I use Hive 2.8 knowing I'm using
> Hadoop 3 ?
> Any insight on hive release cycle woudl be awesome.
>
> i hope you have a nice day.
>
> Antoine DUBOIS
>
>


Patches to Hive 3.1.2,

2021-08-12 Thread Sungwoo Park
Hello Hive users,

We have updated the repository that backports patches to Hive 3.1.2. Now it
backports about 350 patches from the master branch to branch-3.1 of
November 2020. You can ignore the last two commits which add MR3 backend
and remove Hive on Spark.

https://github.com/mr3project/hive-mr3

The focus is mainly on fixing bugs in Hive 3.1.2 and stabilizing the
performance when using AWS S3. We will keep backporting more patches, so if
you think important patches are missing, please feel free to create issues.

Hope you find it useful!

--- Sungwoo


Re: Hive servers restarting every few hours

2021-10-13 Thread Sungwoo Park
Hi,

For 1, Hive 3.1.2 has a bug which leaks Metastore connections. This was
reported in HIVE-20600:

https://issues.apache.org/jira/browse/HIVE-20600

You might reproduce the bug by inserting values into a table and checking
the number of connections, e.g.:
0: jdbc:hive2://blue0:9852/> CREATE TABLE leak_test (id int, value string);
0: jdbc:hive2://blue0:9852/> insert into leak_test values (1, 'hello'), (2,
'world');
...
0: jdbc:hive2://blue0:9852/> insert into leak_test values (1, 'hello'), (2,
'world');

2021-08-09T02:15:04,263  INFO [HiveServer2-Background-Pool: Thread-250]
metastore.HiveMetaStoreClient: Closed a connection to metastore, current
connections: 20
2021-08-09T02:15:04,269  INFO [HiveServer2-Background-Pool: Thread-250]
metastore.HiveMetaStoreClient: Opened a connection to metastore, current
connections: 21

Applying HIVE-21206 can fix the bug:

https://issues.apache.org/jira/browse/HIVE-21206

--- Sungwoo


On Mon, Oct 11, 2021 at 8:34 PM Manikaran Kathuria <
kathuriamanika...@gmail.com> wrote:

> Hi,
> I hope everyone is doing good during this pandemic. I have some questions
> related to hive server configuration. In our current set up, we are running
> 6 hive server instances on k8s pods. We are using hive version *3.1.2 *with
> Java 8. The container memory associated with each pod is 24G. We are
> observing that the hive servers are crashing with the OOM Java heap error.
> We have set the max heap size to *12G*. We are using *Parallel GC
> collectors* i.e., PS Scavenge and PS MarkSweep for young gen and the old
> gen GCs respectively. Following are our observations-
> 1. The connections to hive metastore kept increasing. Before the server
> crashed, we have seen the number of connections to metastore as high as
> 1.2k. Connection leakage?
> 2. We have also observed that a few times the servers crashed because the
> container memory was full. As we have set max heap size to 12G, the servers
> crashing because native memory was full felt strange. On digging the
> process map from another instance using high native memory (chart of the
> memory used by hive server attached), we found that the memory was
> allocated to multiple* 64M blocks*.These 64M blocks are called *arenas*.
> We can limit the memory growth by using *jemalloc* instead of malloc from
> glibc or setting the *maximum number of allowed arenas*. Is it a common
> issue in hive servers? Any recommendations on how to solve this issue of
> high native memory being used?
> 3. Another observation, when the hive servers restarted, we found the Old
> gen space of heap was full but the memory committed to young gen was much
> lesser than the maximum memory allocated to young gen pool. To be specific
> about one of the instances, total heap: *12G: *Old Gen memory used: *8G: 
> *Young
> Gen Used *360M *(Committed: *708M, *Max: *4G)*. [Chart of heap memory
> usage attached]. This results in consecutive full GCs before the server
> crashes. Should we consider using some other GC? Any recommendations or
> tuning suggestions?
> Please find the attached charts.
> Any help would be highly appreciated.
>
> Thanks,
> Manikaran Kathuria
>


Re: Future release of hive

2021-09-21 Thread Sungwoo Park
Actually we can run Hive 3.1.2 with Ranger!

To run Hive 3.1.2 with Ranger 2.0.0, you could set:

hive.security.authorization.enabled=true
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
hive.security.authorization.manager=org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
hive.privilege.synchronizer=true

For Ranger 2.0.0, RangerHiveAuthorizerBase.getHivePolicyProvider() returns
null, so it is okay to set hive.privilege.synchronizer to true, and you
don't have to set up ZooKeeper.

To run Hive 3.1.2 with Ranger 2.1.0 but without ZooKeeper, you need to set
hive.privilege.synchronizer to false because
RangerHiveAuthorizer.getHivePolicyProvider() returns
RangerHivePolicyProvider. If hive.privilege.synchronizer is set to true,
ZooKeeper should be running.

So, with Ranger 2.0.0 or 2.1.0, you can run Hive 3.1.2 without ZooKeeper.
(Of course, you can run it with ZooKeeper, too.) It may take a while (like
a few seconds) for a new Ranger policy to be delivered to HiveServer2, but
this does not seem like an issue in practice.

--- Sungwoo

On Tue, Sep 21, 2021 at 6:50 PM Antoine DUBOIS 
wrote:

> Yes I can.
> You cannot use Ranger without having to configure an instance of zookeeper
> to run for unclear reasons.
>
> public void startPrivilegeSynchonizer(HiveConf hiveConf) throws Exception {
>
>   PolicyProviderContainer policyContainer = new PolicyProviderContainer();
>   HiveAuthorizer authorizer = SessionState.get().getAuthorizerV2();
>   if (authorizer.getHivePolicyProvider() != null) {
> policyContainer.addAuthorizer(authorizer);
>   }
>   if (hiveConf.get(MetastoreConf.ConfVars.PRE_EVENT_LISTENERS.getVarname()) 
> != null &&
>   
> hiveConf.get(MetastoreConf.ConfVars.PRE_EVENT_LISTENERS.getVarname()).contains(
>   
> "org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener")
>  &&
>   
> hiveConf.get(MetastoreConf.ConfVars.HIVE_AUTHORIZATION_MANAGER.getVarname())!=
>  null) {
> List providers = 
> HiveUtils.getMetaStoreAuthorizeProviderManagers(
> hiveConf, HiveConf.ConfVars.HIVE_METASTORE_AUTHORIZATION_MANAGER, 
> SessionState.get().getAuthenticator());
> for (HiveMetastoreAuthorizationProvider provider : providers) {
>   if (provider.getHivePolicyProvider() != null) {
> policyContainer.addAuthorizationProvider(provider);
>   }
> }
>   }
> [...]
>
> if (policyContainer.size() > 0) {
>   zKClientForPrivSync = startZookeeperClient(hiveConf);
>   String rootNamespace = 
> hiveConf.getVar(HiveConf.ConfVars.HIVE_SERVER2_ZOOKEEPER_NAMESPACE);
>
>
> So as long as you are using ranger you must use zookeeper and
> configuration in this case is unclear.
> I never managed to make it work properly.
> It seems like version 3.1.2 is no longer developed or supported and only
> 2.x is still under developpement.
> Looks like cloudera buying HDP makes development less active in the end...
>
> --
> *De: *"Battula, Brahma Reddy" 
> *À: *user@hive.apache.org
> *Envoyé: *Vendredi 17 Septembre 2021 21:15:51
> *Objet: *Re: Future release of hive
>
>
>
> Can you please give more details on issues which you faced with hive-3.1.2
> and ranger-2.1.0..?
>
>
>
>
>
> *From: *Antoine DUBOIS 
> *Date: *Tuesday, 14 September 2021 at 6:20 PM
> *To: *user@hive.apache.org 
> *Subject: *Future release of hive
>
> Hello
>
> After trying to use hive 3.1.2 for several weeks with ranger, I stop.
> It's seems way too complicated and tedious.
>
> I wonder when or even if there will be any more release in the 3.0 branch.
>
> I wonder if Hive 3.0 was just an experience as it seems maintenance is not
> really there.
> Is there any plan for Hive 4.0 or should I use Hive 2.8 knowing I'm using
> Hadoop 3 ?
> Any insight on hive release cycle woudl be awesome.
>
>
>
> i hope you have a nice day.
>
>
>
> Antoine DUBOIS
>
>
>
>


Re: Future release of hive

2021-09-21 Thread Sungwoo Park
Sorry, I missed one thing -- you need to backport:
HIVE-20344: PrivilegeSynchronizer for SBA might hit AccessControlException
(Daniel Dai, reviewed by Vaibhav Gumashta)

--- Sungwoo

On Wed, Sep 22, 2021 at 12:24 AM Sungwoo Park  wrote:

> Actually we can run Hive 3.1.2 with Ranger!
>
> To run Hive 3.1.2 with Ranger 2.0.0, you could set:
>
> hive.security.authorization.enabled=true
>
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>
> hive.security.authorization.manager=org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
> hive.privilege.synchronizer=true
>
> For Ranger 2.0.0, RangerHiveAuthorizerBase.getHivePolicyProvider() returns
> null, so it is okay to set hive.privilege.synchronizer to true, and you
> don't have to set up ZooKeeper.
>
> To run Hive 3.1.2 with Ranger 2.1.0 but without ZooKeeper, you need to set
> hive.privilege.synchronizer to false because
> RangerHiveAuthorizer.getHivePolicyProvider() returns
> RangerHivePolicyProvider. If hive.privilege.synchronizer is set to true,
> ZooKeeper should be running.
>
> So, with Ranger 2.0.0 or 2.1.0, you can run Hive 3.1.2 without ZooKeeper.
> (Of course, you can run it with ZooKeeper, too.) It may take a while (like
> a few seconds) for a new Ranger policy to be delivered to HiveServer2, but
> this does not seem like an issue in practice.
>
> --- Sungwoo
>
> On Tue, Sep 21, 2021 at 6:50 PM Antoine DUBOIS 
> wrote:
>
>> Yes I can.
>> You cannot use Ranger without having to configure an instance of
>> zookeeper to run for unclear reasons.
>>
>> public void startPrivilegeSynchonizer(HiveConf hiveConf) throws Exception {
>>
>>   PolicyProviderContainer policyContainer = new PolicyProviderContainer();
>>   HiveAuthorizer authorizer = SessionState.get().getAuthorizerV2();
>>   if (authorizer.getHivePolicyProvider() != null) {
>> policyContainer.addAuthorizer(authorizer);
>>   }
>>   if (hiveConf.get(MetastoreConf.ConfVars.PRE_EVENT_LISTENERS.getVarname()) 
>> != null &&
>>   
>> hiveConf.get(MetastoreConf.ConfVars.PRE_EVENT_LISTENERS.getVarname()).contains(
>>   
>> "org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener")
>>  &&
>>   
>> hiveConf.get(MetastoreConf.ConfVars.HIVE_AUTHORIZATION_MANAGER.getVarname())!=
>>  null) {
>> List providers = 
>> HiveUtils.getMetaStoreAuthorizeProviderManagers(
>> hiveConf, HiveConf.ConfVars.HIVE_METASTORE_AUTHORIZATION_MANAGER, 
>> SessionState.get().getAuthenticator());
>> for (HiveMetastoreAuthorizationProvider provider : providers) {
>>   if (provider.getHivePolicyProvider() != null) {
>> policyContainer.addAuthorizationProvider(provider);
>>   }
>> }
>>   }
>> [...]
>>
>> if (policyContainer.size() > 0) {
>>   zKClientForPrivSync = startZookeeperClient(hiveConf);
>>   String rootNamespace = 
>> hiveConf.getVar(HiveConf.ConfVars.HIVE_SERVER2_ZOOKEEPER_NAMESPACE);
>>
>>
>> So as long as you are using ranger you must use zookeeper and
>> configuration in this case is unclear.
>> I never managed to make it work properly.
>> It seems like version 3.1.2 is no longer developed or supported and only
>> 2.x is still under developpement.
>> Looks like cloudera buying HDP makes development less active in the end...
>>
>> --
>> *De: *"Battula, Brahma Reddy" 
>> *À: *user@hive.apache.org
>> *Envoyé: *Vendredi 17 Septembre 2021 21:15:51
>> *Objet: *Re: Future release of hive
>>
>>
>>
>> Can you please give more details on issues which you faced with
>> hive-3.1.2 and ranger-2.1.0..?
>>
>>
>>
>>
>>
>> *From: *Antoine DUBOIS 
>> *Date: *Tuesday, 14 September 2021 at 6:20 PM
>> *To: *user@hive.apache.org 
>> *Subject: *Future release of hive
>>
>> Hello
>>
>> After trying to use hive 3.1.2 for several weeks with ranger, I stop.
>> It's seems way too complicated and tedious.
>>
>> I wonder when or even if there will be any more release in the 3.0 branch.
>>
>> I wonder if Hive 3.0 was just an experience as it seems maintenance is
>> not really there.
>> Is there any plan for Hive 4.0 or should I use Hive 2.8 knowing I'm using
>> Hadoop 3 ?
>> Any insight on hive release cycle woudl be awesome.
>>
>>
>>
>> i hope you have a nice day.
>>
>>
>>
>> Antoine DUBOIS
>>
>>
>>
>>


Re: Hive-3 with hadoop-2.x.

2022-03-15 Thread Sungwoo Park
Up to MR3 version 1.2, Hive-MR3 supported Hive 3.1.2 on Hadoop 2.7+. From
MR3 version 1.3 on, we did not release distributions for Hadoop 2.7+
because all use cases in production were using Hadoop 3+. (However it's
still easy for us to build a distribution for Hadoop 2.7+.)

When we were implementing Hive-MR3 for Hadoop 2.7+, we did not find any
attempt to run vanilla Hive 3.1.2 on Hadoop 2.7+.

--- Sungwoo

On Tue, Mar 15, 2022 at 8:53 AM Purshotam Shah 
wrote:

> We have been running hive 1.2.1 and planning to migrate to hive-3.1.2.
> We are running on Hadoop-2.10.x.
> I tried to change the Hadoop version to 2.10.x in pom.xml and got
> compilation errors.
> Does hive-3 support Hadoop-2.x?
> Has anyone running hive-3 with Hadoop-2.10.x.
>
> Thanks,
>


Re: Too many S3 API calls for simple queries like select and create external table

2022-02-20 Thread Sungwoo Park
My understanding is that additional calls to S3 APi is the price to pay for
using the Hadoop library which only emulates FileSystem on top of S3. S3 is
not a distributed file system like HDFS, so some of the API calls cannot be
optimized in an ideal way.

For (i), a more serious problem is the cost of traversing the entire
directory, which is totally unnecessary. This was fixed in HIVE-24849:

https://issues.apache.org/jira/browse/HIVE-24849

You can find several JIRAs that try to reduce the overhead of calling S3 at
a higher level, e.g.:

https://issues.apache.org/jira/browse/HIVE-25277
https://issues.apache.org/jira/browse/HIVE-24546

We can also remove some of S3 calls with a technique described in
HIVE-24546. However, I think, unless some optimization is implemented at
the level of Hadoop, the overhead cannot be completely eliminated.

--- Sungwoo

On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari  wrote:

> Hi Team,
>
> We are using Hive heavily for our ETL where our data is stored on S3 and
> so we have seen a strange behaviour between Hive & S3 interaction in terms
> of S3 API calls i.e. Actual number of API calls for simple select
> statements were too much compared to expected so let us know why it is
> behaving like this because if we execute the same select statement via
> Athena then the number of API calls are very less.
>
>
>
> *Background :-*
> We are incurring some S3 API cost and to understand each API call better,
> we decided to do simple testing.
>
> 1. We have a non partition table containing a lot of objects in parquet
> format on S3.
>
> 2. We copied one parquet file object(data) to a separate S3 bucket(target)
> so now our target bucket contains one parquet file data in following
> hierarchy on S3 :-
> s3:///Test/00_0   (Size of object : 1218 Bytes)
>
> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
> managed by us on EC2 cluster :-
>
> (i) Create External table on top of above S3 location :-
>
> CREATE EXTERNAL TABLE `anup.Test`(
>   `id` int,
>   `cname` varchar(45),
>   `mef` decimal(10,3),
>   `mlpr` int,
>   `qperiod` int,
>   `validity` int,
>   `rpmult` decimal(10,3))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a:///Test' ;
>
> (ii) msck repair table Test(Just to test behaviour) ;
> (iii) Simple select statement :- select * from anup.Test ;
>
>
>
>
> *Results :-*
> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>
>
> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
> create "Test/" object.
> 2. msck repair : 1-2 API calls ; since we have single object behind table
> 3. select *  : 1-2 API calls ; since we have single object behind table
>
>
> But *Actual number of Total API calls was 37* and we have fetched this
> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>
> 1. Create External : 9 API calls
> 2. msck repair : 3 API calls
> 3. select *  : 25 API calls
>
>
> Attaching actual results of S3 Access Logs for select command along with
> DEBUG logs of Hive for select statement.
>
> Let us know why so many API calls are happening for the Create External /
> select statement because if we execute the same select statement *via
> Athena* then the number of API calls are very less i.e. *2*.
>
>
>
>
> *Tools / S3 library details :-*
> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
> aws-java-sdk-core-1.10.6.jar
>
> Regards,
> Anup Tiwari
>


MR3 1.4 released,

2022-02-20 Thread Sungwoo Park
We are pleased to announce MR3 1.4 and MR3 App.

1.
We have backported over 600 patches to Apache Hive 3.1.

https://github.com/mr3project/hive-mr3

This repository is maintained as part of developing Hive on MR3, but can
also be used for building Apache Hive (by ignoring the last two commits).
This repository can be useful to those who want to maintain their own fork
of Hive 3.1.2.

2.
MR3 App is a web-based interface for running Hive on MR3 on Kubernetes.
After specifying configuration parameters, you can run Hive on MR3, Apache
Ranger, Grafana (with Prometheus), MR3-UI (with Timeline Server), and
Superset for BI. For running Hive on Kubernetes, MR3 App is the easiest
way. Any feedback will be greatly appreciated :-)

https://mr3docs.datamonad.com/docs/app/
https://app.datamonad.com

3.
For the release notes, please see:

https://mr3docs.datamonad.com/docs/release/

--- Sungwoo


Performance evaluation of Spark 2, Spark 3, Hive-LLAP, MR3 1.4

2022-04-07 Thread Sungwoo Park
Hi Hive users,

Here is our latest article on the performance of Spark 2, Spark 3, and Hive
3. Hope you find it interesting.

https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/

Spark 3 is catching up with Hive very fast, at least when executing
sequential queries. For interactive queries, Spark 3 is nearly as fast as
Hive-LLAP.

--- Sungwoo


Re: Announce: Hive-MR3 with Celeborn,

2023-11-01 Thread Sungwoo Park
On Thu, Nov 2, 2023 at 1:43 PM Sungwoo Park  wrote:

> Have you done comparison between uniffle and celeborn..?
>>
>
> We did not compare the performance of Uniffle and Celeborn (because
> Hive-MR3-Celeborn has been released but Hive-MR3-Uniffle is not complete
> yet). Much of the code in Hive-MR3-Celeborn is currently reused in
> Hive-MR3-Uniffle, so we think there are many architectural similarities
> between the two systems.
>
> We implemented our Celeborn extension first because a user of Hive-MR3
> wanted to use Celeborn which was already running in production. If any
> industrial user of Hive-MR3 wants to use Uniffle in production, please let
> us know.
>
> BTW, if you are using Hive-on-MapReduce or Hive-on-Tez, consider switching
> to Hive-on-Tez. You will see a huge increase (x3 to x10) in throughput.
>

Ooops, I meant switching to Hive-on-MR3 :-)


> Regards,
>
> --- Sungwoo
>
>


Re: Announce: Hive-MR3 with Celeborn,

2023-11-01 Thread Sungwoo Park
>
> Have you done comparison between uniffle and celeborn..?
>

We did not compare the performance of Uniffle and Celeborn (because
Hive-MR3-Celeborn has been released but Hive-MR3-Uniffle is not complete
yet). Much of the code in Hive-MR3-Celeborn is currently reused in
Hive-MR3-Uniffle, so we think there are many architectural similarities
between the two systems.

We implemented our Celeborn extension first because a user of Hive-MR3
wanted to use Celeborn which was already running in production. If any
industrial user of Hive-MR3 wants to use Uniffle in production, please let
us know.

BTW, if you are using Hive-on-MapReduce or Hive-on-Tez, consider switching
to Hive-on-Tez. You will see a huge increase (x3 to x10) in throughput.

Regards,

--- Sungwoo


Re: Announce: Hive-MR3 with Celeborn,

2023-11-02 Thread Sungwoo Park
Celeborn and Uniffle can also be seen as a move to separate local storage
from compute nodes.

1. In the old days, Hadoop was based on the idea of collocating compute and
storage.
2. Later a new paradigm of separating compute and storage emerged and got
popularized.
3. Now people want to not just separate compute and storage, but also
separate local storage from compute nodes.

In the future, all of shuffle/spill files might be stored in a dedicated
system like Celeborn and Uniffle. In our case of developing Hive-MR3, we
completely removed spill files for unordered edges thanks to the efficient
buffering in Celeborn.

Thanks,

--- Sungwoo

On Thu, Nov 2, 2023 at 7:31 PM Keyong Zhou  wrote:

> I think both Celeborn and Uniffle are good alternatives as a general
> shuffle service.
> I recommend that you try them : ). For any question about Celeborn, we're
> very glad
> to discuss in Celeborn's mail lists[1][2] or slack[3].
>
> [1] u...@celeborn.apache.org
> [2] d...@celeborn.apache.org
> [3]
> https://join.slack.com/t/apachecelebor-kw08030/shared_invite/zt-1ju3hd5j8-4Z5keMdzpcVMspe4UJzF4Q
>
> Thanks,
> Keyong Zhou
>
> On 2023/10/31 14:24:38 "Battula, Brahma Reddy" wrote:
> > Thanks for bringing up this. Good to see that it supports spark and
> flink.
> >
> > Have you done comparison between uniffle and celeborn..?
> >
> >
> > On 30/10/23, 8:01 AM, "Keyong Zhou"  zho...@apache.org>> wrote:
> >
> >
> > Great to hear this! It's encouraging that Celeborn helps MR3.
> >
> >
> > Celeborn is a general purpose remote shuffle service that stores and
> serves
> > shuffle data (and other intermediate data in the future) to help compute
> engines
> > better use disaggregated architecture, as well as become more efficient
> and
> > stable for huge shuffle sized jobs.
> >
> >
> > Currently Celeborn supports Hive on MR, and I think integrating with MR3
> > provides a good example to support Hive on Tez.
> >
> >
> > Thanks,
> > Keyong Zhou
> >
> >
> > On 2023/10/24 12:08:54 Sungwoo Park wrote:
> > > Hi Hive users,
> > >
> > > Before the impending release of MR3 1.8, we would like to announce the
> > > release of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn
> > > 0.3.1).
> > >
> > > Apache Celeborn [1] is remote shuffle service, similar to Magnet [2]
> and
> > > Apache Uniffle [3] (which was discussed in this Hive mailing list a
> while
> > > ago). Celeborn officially supports Spark and Flink, and we have
> implemented
> > > an MR3-extension for Celeborn.
> > >
> > > In addition to all the benefits of using remote shuffle service,
> > > Hive-MR3-Celeborn supports direct processing of mapper output on the
> > > reducer side, which means that reducers do not store mapper output on
> local
> > > disks (for unordered edges). In this way, Hive-MR3-Celeborn can
> eliminate
> > > over 95% of local disk writes when tested on the 10TB TPC-DS benchmark.
> > > This can be particularly useful when running Hive-MR3 on public clouds
> > > where fast local disk storage is expensive or not available.
> > >
> > > We have documented the usage of Hive-MR3-Celeborn in [4]. You can
> download
> > > Hive-MR3-Celeborn in [5].
> > >
> > > FYI, MR3 is an execution engine providing native support for Hadoop,
> > > Kubernetes, and standalone mode [6]. Hive-MR3, its main application,
> > > provides the performance of LLAP yet is very easy to install and
> operate.
> > > If you are using Hive-Tez for running ETL jobs, switching to Hive-MR3
> will
> > > give you a much higher throughput thanks to its advanced resource
> sharing
> > > model.
> > >
> > > We have recently opened a Slack channel. If interested, please join the
> > > Slack channel and ask any question on MR3:
> > >
> > >
> https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg
> <
> https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg
> >
> > >
> > > Thank you,
> > >
> > > --- Sungwoo
> > >
> > > [1] https://celeborn.apache.org/ <https://celeborn.apache.org/>
> > > [2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf <
> https://www.vldb.org/pvldb/vol13/p3382-shen.pdf>
> > > [3] https://uniffle.apache.org/ <https://uniffle.apache.org/>
> > > [4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/ <
> https://mr3docs.datamonad.com/docs/mr3/features/celeborn/>
> > > [5] https://github.com/mr3project/mr3-release/releases/tag/v1.8 <
> https://github.com/mr3project/mr3-release/releases/tag/v1.8>
> > > [6] https://mr3docs.datamonad.com/ <https://mr3docs.datamonad.com/>
> > >
> >
> >
> >
> >
>


Fwd: Release of Hive 4 and TPC-DS benchmark

2023-11-03 Thread Sungwoo Park
Forwarded to user@hive as I think many people are curious about the release
of Hive 4.

-- Forwarded message -
From: Sungwoo Park 
Date: Sat, Nov 4, 2023 at 12:42 AM
Subject: Release of Hive 4 and TPC-DS benchmark
To: 


Hi everyone,

I would like to resume the discussion on the release of Hive 4 and
the result of the TPC-DS benchmark.

Currently there are four unresolved JIRAs marked 'hive-4.0.0-must' which
must be
resolved before the release of Hive 4 ([1], [2], [3], [4]). The most urgent
one
is perhaps HIVE-26654 [1] which reports failing queries in the TPC-DS
benchmark.
(All these bugs were introduced after the release of Hive 3.1.2 which
passes all
the TPC-DS tests.)

Originally we reported 7 failing cases in HIVE-26654. Since then, 3 cases
have
been resolved, 2 cases have pull requests, and 2 cases don't have pull
requests
yet.

1. Query 17: Resolved in HIVE-26655 [6]
2. Query 16, 69, 94: Resolved in HIVE-26659 [8]
3. Query 64: Resolved in HIVE-26968 [10]

4. Query 2: Pull request available in HIVE-27006 [5]
5. Query 71: Pull request available in HIVE-26986 [9]

6. Query 14: Reported in HIVE-24167 [7]
7. Query 97: Reported in HIVE-27269 [11]

Seonggon and I (in MR3 team) have been working on these problems, and so
far we
have submitted 4 pull requests. Two of them have been merged, but the other
two
are not being reviewed (for query 2 and query 71). I'd apprecite it very
much if
Hive committers could review the remaining pull requests.

The remainging problems are query 14 and query 97.

For query 14, I suggest that we take a simple workaround by setting
hive.optimize.cte.materialize.threshold to -1 by default because nobody
seems to
working on this JIRA. If necessary, we could try to fix it after the
release of
Hive 4.

For query 97 (which we think is the most challenging one among all the
sub-JIRAs), we have a few choices:

1) Use a quick-fix solution by ignoring hive.mapjoin.hashtable.load.threads
when
FullOuterJoin is used
2) Fix HIVE-25583 [12] which introduces this bug
3) Fix it properly

I suggest that we take a quick-fix solution and revisit the problem after
the
release of Hive 4.

(We have also observed performance regression in Hive, but I guess another
topic
to discuss after fixing correctness issues.)

Please let us know what you think.

Thanks,

--- Sungwoo

[1] https://issues.apache.org/jira/browse/HIVE-26654
[2] https://issues.apache.org/jira/browse/HIVE-27226
[3] https://issues.apache.org/jira/browse/HIVE-26505
[4] https://issues.apache.org/jira/browse/HIVE-22636
[5] https://issues.apache.org/jira/browse/HIVE-27006
[6] https://issues.apache.org/jira/browse/HIVE-26655
[7] https://issues.apache.org/jira/browse/HIVE-24167
[8] https://issues.apache.org/jira/browse/HIVE-26659
[9] https://issues.apache.org/jira/browse/HIVE-26986
[10] https://issues.apache.org/jira/browse/HIVE-26968
[11] https://issues.apache.org/jira/browse/HIVE-27269
[12] https://issues.apache.org/jira/browse/HIVE-25583


Announce: Hive-MR3 with Celeborn,

2023-10-24 Thread Sungwoo Park
Hi Hive users,

Before the impending release of MR3 1.8, we would like to announce the
release of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn
0.3.1).

Apache Celeborn [1] is remote shuffle service, similar to Magnet [2] and
Apache Uniffle [3] (which was discussed in this Hive mailing list a while
ago). Celeborn officially supports Spark and Flink, and we have implemented
an MR3-extension for Celeborn.

In addition to all the benefits of using remote shuffle service,
Hive-MR3-Celeborn supports direct processing of mapper output on the
reducer side, which means that reducers do not store mapper output on local
disks (for unordered edges). In this way, Hive-MR3-Celeborn can eliminate
over 95% of local disk writes when tested on the 10TB TPC-DS benchmark.
This can be particularly useful when running Hive-MR3 on public clouds
where fast local disk storage is expensive or not available.

We have documented the usage of Hive-MR3-Celeborn in [4]. You can download
Hive-MR3-Celeborn in [5].

FYI, MR3 is an execution engine providing native support for Hadoop,
Kubernetes, and standalone mode [6]. Hive-MR3, its main application,
provides the performance of LLAP yet is very easy to install and operate.
If you are using Hive-Tez for running ETL jobs, switching to Hive-MR3 will
give you a much higher throughput thanks to its advanced resource sharing
model.

We have recently opened a Slack channel. If interested, please join the
Slack channel and ask any question on MR3:

https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg

Thank you,

--- Sungwoo

[1] https://celeborn.apache.org/
[2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf
[3] https://uniffle.apache.org/
[4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/
[5] https://github.com/mr3project/mr3-release/releases/tag/v1.8
[6] https://mr3docs.datamonad.com/


Re: Specifying YARN Node (Label) for LLAP AM

2023-08-19 Thread Sungwoo Park
Hello,

For more recent benchmark results, please see [1] where we compare Trino
418, Spark 3.4.0, and Hive 3.1.3 (on MR3 1.7) using TPC-DS 10TB. Spark
takes about 19600 seconds to complete all the queries, whereas Trino and
Hive take about 7400 seconds only. The experiment does not use Hive-LLAP,
but you may think of Hive on MR3 as a substitute for Hive-LLAP because
both systems are comparable in performance. In the experiment, we tried our
best to get the best performance of Trino and Spark, and did not
intentionally penalize them in order to favor Hive.

Spark is a great project with many cool features and a huge lively
community, but speed is no longer a key feature of Spark that would
differentiate itself from other competing technologies. As far as speed is
concerned, it seems that Spark folks are living inside their own world,
still believing in its so-called 'in-memory' computing technology.

Recently someone read the article [1] and summarily dismissed the results,
saying "it is simply impossible for Hive to run faster than Spark". I guess
many people still think that Hive is so slow and only for good ETL.

Regards,

--- Sungwoo
[1]
https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/

On Sat, Aug 19, 2023 at 7:18 PM Aaron Grubb  wrote:

> Hi Mich,
>
> It's not a question of cannot but rather a) is it worth converting our
> pipelines from Hive to Spark and b) is Spark more performant than LLAP, and
> in both cases the answer seems to be no. 2016 is a lifetime ago in
> technological time and since then there's been a major release of Hive as
> well as many minor releases. When we started looking for our "big data
> processor" 2 years ago, we had evaluated Spark, Presto, AWS Athena and Hive
> on LLAP and all literature pointed to Hive on LLAP being the most
> performant, in particular when you're able to take advantage of the ORC
> footer caching. If you'd like to review some benchmarks, you can take a
> look at this [1] but the direct comparison between Spark and LLAP is done
> with a fork of Hive.
>
> Regards,
> Aaron
>
> [1] https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>
> On Fri, 2023-08-18 at 16:06 +0100, Mich Talebzadeh wrote:
>
> interesting!
>
> In 2016 I gave a presentation in London, in Future of DataOrganised by
> Hortonworks July 20, 2016,
>
> Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
> 
>
>
> Then I thought Spark as an underlying engine for Hive did the best job.
> However, I am not sure there has been many new developments to make Spark
> as the underlying engine for Hive. Any particular reason you cannot use
> Spark as the ET: tool with Hive providing the underlying storage? Spark has
> excellent APIs to work with hive including spark thrift server (which is
> under the bonnet Hive thrift server).
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:45, Aaron Grubb  wrote:
>
> Hi Mich,
>
> Yes, that's correct
>
> On Fri, 2023-08-18 at 15:24 +0100, Mich Talebzadeh wrote:
>
> Hi,
>
> Are you using LLAP (Long live and prosper) as a Hive engine?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:09, Aaron Grubb  wrote:
>
> For those interested, I managed to define a way to launch the LLAP
> application master and daemons on separate, targeted machines. It was
> inspired by an article I found [1] and implemented using YARN Node Labels
> [2] and Placement Constraints [3] with a modification to the file
> scripts/llap/yarn/templates.py. Here are the basic instructions:
>
> 1. Configure YARN to enable placement constraints and node labels. You
> have the option of using 2 node labels or 1 node label + the default
> partition. The machines that are intended to run the daemons must have a
> label associated with them. If you 

Web-based interface for running Hive on Amazon EKS and Kubernetes

2022-06-06 Thread Sungwoo Park
Hi Hive users,

We created MR3 Cloud, a web-based interface for executing Hive on Amazon
EKS and Kubernetes. After specifying parameters in an interactive way, the
user can download YAML files for creating an EKS cluster and Kubernetes
objects. The user can create all the following components at once:
HiveServer2 + Metastore (Hive 3.1.2 + over 600 patches), Ranger, Superset,
Grafana (with Prometheus), MR3-UI (with Timeline).

MR3 Cloud: https://cloud.datamonad.com/
Documentation: https://mr3docs.datamonad.com/docs/cloud/
Quick start guide for Amazon EKS:
https://mr3docs.datamonad.com/docs/quick/aws/run-eks-cloud/
Quick start guide for K8s:
https://mr3docs.datamonad.com/docs/quick/k8s/cloud-k8s/

Thanks & Any feedback will be appreciated very much.

--- Sungwoo


Re: [DISCUSS] End of life for Hive 1.x, 2.x, 3.x

2022-05-10 Thread Sungwoo Park
We maintain our own fork of Hive 3 because we are not always adding new
commits to the tip of the branch. To backport a new patch, sometimes we
have to add new commits between existing commits, update earlier commits,
and so on. This makes it impractical to keep adding new patches only to the
tip of the branch while reverting commits if necessary. Maintaining the
Hive 3 branch would mean frequent force-updates, which might produce more
problems. (If this is not an issue, we could try to completely rebuild the
Hive 3 branch.)

I hope the Apache community can make a concerted effort to figure out what
patches to include in Hive 3. For us, the challenge was 1) to decide which
patch to include; 2) to figure out its dependencies if any; 3) to resolve
conflicts. Testing was also another source of pain.

Thanks,

--- Sungwoo





On Tue, May 10, 2022 at 4:26 PM Peter Vary  wrote:

> When we were brainstorming about the future of the Hive 3 branch with
> Zoltan Haindrich, he mentioned this letter:
> https://lists.apache.org/thread/by9ppc2z8oqdzpqotzv5bs34yrxrd84l
>
> I think Sungwoo Park and his team makes a huge effort to maintain this
> branch, and maybe it would be better to help them do this inside the Apache
> Hive project. They should not need to maintain their own branch if there is
> no particular reason behind it, or we can remove those blockers. This could
> be beneficial for every Hive user who still uses Hive 3.
>
> @Sungwoo: Do you have any specific reason to keep you own fork of Hive 3?
>
> That would mean we could have a much better Hive 3.x branch than we have
> now.
>
> What do you think?
>
> Thanks,
> Peter
>
>
>
> On 2022. May 10., at 8:40, Battula, Brahma Reddy <
> bbatt...@visa.com.INVALID> wrote:
>
> Agree to Peter and sunchao..
>
> Even we are using the hive 3.x, we might contribute on bugfixes.
>
> Even I am +1 on 1.x EOL as it's hard to maintain so many releases and time
> to user's migrate to 2.x and 3.x.
>
>
> On 09/05/22, 10:51 PM, "Chao Sun"  wrote:
>
>Agree to Peter above. I know quite a few projects such as Spark,
>Iceberg and Trino/Presto are depending on Hive 2.x and 3.x, and
>periodically they may need new fixes in these. Upgrading them to use
>4.x seems not an option for now since the core classified artifact has
>been removed and the shading issue has to be solved before they can
>consume the new jar.
>
>On Mon, May 9, 2022 at 4:10 AM Peter Vary  wrote:
>
>
> Hi Team,
>
> My experience with the Iceberg community shows that there are some
> sizeable userbase around Hive 2.x. I have seen patches, contributions to
> Hive 2.3.x branches, and the tests are in much better shape there.
>
> I would definitely vote for EOL Hive 1.x, but until we have a stable 4.x,
> I would be cautious about slashing 2.x, 3.x branches.
>
> Just my 2 cents.
>
> Peter
>
> On 2022. May 9., at 10:51, Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
> Hi Stamatis,
> thanks for bringing up this topic, I basically agree on everything you
> wrote.
>
> I just wanted to add that this kind of proposal might sound harsh, because
> in many contexts upgrading is a complex process, but it's in nobody's
> interest to keep release branches that are missing important
> fixes/improvements and that might not meet the quality standards that
> people expect, as mentioned.
>
> Since we don't have yet a stable 4.x release (only alpha for now) we might
> want to keep supporting the 3.x branch until the first 4.x stable release
> and EOL < 3.x branches, WDYT?
>
> Best regards,
> Alessandro
>
> On Fri, 6 May 2022 at 23:14, Stamatis Zampetakis 
> wrote:
>
>
> Hi all,
>
> The current master has many critical bug fixes as well as important
> performance improvements that are not backported (and most likely never
> will) to the maintenance branches.
>
> Backporting changes from master usually requires adapting the code and
> tests in questions making it a non-trivial and time consuming task.
>
> The ASF bylaws require PMCs to deliver high quality software which satisfy
> certain criteria. Cutting new releases from maintenance branches with known
> critical bugs is not compliant with the ASF.
>
> CI is unstable in all maintenance branches making the quality of a release
> questionable and merging new PRs rather difficult. Enabling and running it
> frequently in all maintenance branches would require a big amount of
> resources on top of what we already need for master.
>
> History has shown that it is very difficult or impossible to properly
> maintain multiple release branches for Hive.
>
> I think it would be to the best interest of the project if the PM

Re: External table replication in Hive

2022-08-25 Thread Sungwoo Park
For 1, cherry-picking it to Hive 3 does not work. I tried to
backport HIVE-20911 to Hive 3, but it did not work because of so many
dependencies :-(

--- Sungwoo

On Thu, Aug 25, 2022 at 2:15 AM Bharathkrishna G M 
wrote:

> Hi,
>
> I want to replicate the Hive metastore to create a separate instance (for
> example, replicate the prod metadata to create a dev metastore).
>
> I'm using Hive version 3.1.2 and it only supports managed table
> replication and lacks external table replication.
>
> I want to try and get external tables also replicate on Hive 3. A couple
> of questions here:
>
> 1. I see the Pull request https://github.com/apache/hive/pull/506/files
> for external table replication on Hive 4. Is it viable to cherry-pick it to
> Hive 3, and is this something that has been tried out?
>
> 2. Also, I only want to replicate the metadata and not the data itself.
>
> 3. The replication commands seem to have changed with Hive 4, and I'm
> unable to find the new commands (there's no documentation), the latest
> documentation seen is
> https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development
> which does not cover the external tables.
>
> Kindly guide me on the above.
>
> Thanks,
> Bharath
>


MR3 1.5 released

2022-08-05 Thread Sungwoo Park
Hello Hive users,

MR3 1.5 has been released. Hive 3.1.3 (with more than 600 additional
patches backported) and Spark 3.2.2 are supported.

Hive/Spark on MR3 is a quick and ready solution for you if:

1. You want to migrate from Hadoop to Kubernetes, but continue to use Hive.
2. You want to run Hive and Spark sharing Metastore.
3. You want to maximize resource utilization of Spark applications.

For the quick start guide on Hive/Spark on MR3 on Kubernetes, please see:

https://mr3docs.datamonad.com/docs/quick/k8s/

For those curious about the performance of Hive vs Spark, please see:

https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/

Thank you,

--- MR3 team


Re: Hive 3 has big performance improvement from my test

2023-01-07 Thread Sungwoo Park
In fact, Hive 3 has been much faster than Spark for a long time. For
complex queries, Hive 3 is much faster than Presto (or Trino) as well. The
reality is different from common beliefs on Hive, Spark, and Presto. If
interested, see the result of performance comparison using the TPC-DS
benchmark.

Performance comparison in October 2018:
https://www.datamonad.com/post/2018-10-30-performance-evaluation-0.4/

Performance comparison in April 2022:
https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/

Sungwoo


On Fri, Jan 6, 2023 at 12:35 PM ypeng  wrote:

> Hello,
>
> Just from my personal testing, Hive 3.1.3 has much better performance than
> the old ones.
> It's even as fast as Spark by using the default mr engine.
> My test process and dataset,
> https://blog.crypt.pw/Another-10-million-dataset-testing-for-Spark-and-Hive
>
> Thanks.
>


Re: Hive 3 has big performance improvement from my test

2023-01-07 Thread Sungwoo Park
>
>
> [image: image.png]
>
> from your posting, the result is amazing. glad to know hive on mr3 has
> that nice performance.
>

Hive on MR3 is similar to Hive-LLAP in performance, so we can interpret the
above result as Hive being much faster than SparkSQL. For executing
concurrent queries, the performance gap is even greater. In my (rather
biased) opinion, the key weakness of Spark is 1) its poor performance when
executing concurrent queries and 2) its poor resource utilization when
executing multiple Spark applications concurrently.

We released Hive on MR3 1.6 a couple of weeks ago. Now we have backported
about 700 patches to Hive 3.1. If interested, please check it out:
https://www.datamonad.com/

Sungwoo


Re: Hive 3 has big performance improvement from my test

2023-01-08 Thread Sungwoo Park
Our claim is based on TPC-DS results reported in
https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/ for 1)
and https://www.datamonad.com/post/2021-08-18-spark-mr3/ for 2). Not sure
about what you mean by 'Hive ability on handling MR'. One may draw
different conclusions from the same experimental results and interpret our
TPC-DS results in a different way.

Sungwoo


On Sun, Jan 8, 2023 at 10:01 PM Mich Talebzadeh 
wrote:

> What bothers me is that you are making sweeping statements about Spark
> inability to handle quote " ... the key weakness of Spark is 1) its poor
> performance when executing concurrent queries and 2) its poor resource
> utilization when executing multiple Spark applications concurrently"
> and conversely overstating Hive ability on handling MR.
> In fairness anything published  in a public forum is fair game for analysis
> or criticism. Thenyou are expected to back it up. I cannot see how anyone
> could object to the statement: if you make a claim, be prepared to prove
> it.
>
> I am open minded on this so please clarify the above statement
>
> HTH
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 8 Jan 2023 at 05:21, Sungwoo Park  wrote:
>
>>
>>> [image: image.png]
>>>
>>> from your posting, the result is amazing. glad to know hive on mr3 has
>>> that nice performance.
>>>
>>
>> Hive on MR3 is similar to Hive-LLAP in performance, so we can interpret
>> the above result as Hive being much faster than SparkSQL. For executing
>> concurrent queries, the performance gap is even greater. In my (rather
>> biased) opinion, the key weakness of Spark is 1) its poor performance when
>> executing concurrent queries and 2) its poor resource utilization when
>> executing multiple Spark applications concurrently.
>>
>> We released Hive on MR3 1.6 a couple of weeks ago. Now we have backported
>> about 700 patches to Hive 3.1. If interested, please check it out:
>> https://www.datamonad.com/
>>
>> Sungwoo
>>
>


Re: Specifying YARN Node (Label) for LLAP AM

2023-03-22 Thread Sungwoo Park
Hello,

A similar issue was discussed in the Tez mailing list a long time ago:

https://lists.apache.org/thread/0vjor12lpcncg43rn6vddw8yc1k62c81

Tez still does not support specifying node labels for AMs, but as explained
in the response, this is quite easy to implement if you can re-compile Tez.
(Hive-MR3 is still a valid option, with hundreds of patches backported to
Hive 3.1.3.)

--- Sungwoo



On Wed, Mar 22, 2023 at 7:21 PM Aaron Grubb  wrote:

> Hi all,
>
> I have a Hadoop cluster (3.3.4) with 6 nodes of equal resource size that
> run HDFS and YARN and 1 node with lower resources which only runs YARN that
> I use for Hive AMs, the LLAP AM, Spark AMs and Hive file merge containers.
> The HDFS nodes are set up such that the queue for LLAP on the YARN
> NodeManager is allocated resources exactly equal to what the LLAP daemons
> consume. However, when I need to re-launch LLAP, I currently have to stop
> the NodeManager processes on each HDFS node, then launch LLAP to guarantee
> that the application master ends up on the YARN-only machine, then start
> the NodeManager processes again to let the daemons start spawning on the
> nodes. This used to not be a problem because only Hive/LLAP was using YARN
> but now we've started using Spark in my company and I'm in a position where
> if LLAP happens to crash, I would need to wait for Spark jobs to finish
> before I can re-launch LLAP, which would put our ETL processes behind,
> potentially to unacceptable delays. I could allocate 1 vcore and 1024mb
> memory extra for the LLAP queue on each machine, however that would mean I
> have 5 vcores and 5gb RAM being reserved and unused at all times, so I was
> wondering if there's a way to specify which node to launch the LLAP AM on,
> perhaps through YARN node labels similar to the Spark
> "spark.yarn.am.nodeLabelExpression" configuration? Or even a way to specify
> the node machine through a different mechanism? My Hive version is 3.1.3.
>
> Thanks,
> Aaron
>


Running Hive on Kubernetes,

2023-02-23 Thread Sungwoo Park
Hello,

If you are interested in running Hive on Kubernetes (without requiring
Hadoop), we have updated the quick start guide on running Hive on MR3 on
Kubernetes.

The quick start guide shows step-by-step instructions for running
Metastore, HiveServer2, Ranger, MR3-UI, Grafana, with/without Kerberos and
with/without SSL.

The user can use either shell scripts, Helm charts, or TypeScript code
(which generates a single YAML after checking the consistency in the
configuration parameters).

If interested, please see:
https://mr3docs.datamonad.com/docs/quick/k8s/

--- Sungwoo


Hive on MR3 1.7 released

2023-05-29 Thread Sungwoo Park
Hi Hive users,

I am happy to announce the release of MR3 1.7. MR3 is an execution engine
for big data processing, and its main application Hive on MR3 is an
alternative to Hive-Tez and Hive-LLAP. I would like to summarize its main
features.

1. Hive on MR3 on Hadoop
Hive on MR3 is easy to install on Hadoop. In particular, you don't have to
upgrade Tez and Hadoop to their matching version.

2. Hive on MR3 on Kubernetes
MR3 provides native support for Kubernetes, and Hive on MR3 can run
directly on Kubernetes (without having to operate Hadoop on Kubernetes). On
public clouds like Amazon EKS, one can take advantage of autoscaling and
spot instances.

3. Hive on MR3 in standalone mode
>From version 1.7, MR3 supports standalone mode which does not require a
resource manager like Hadoop and Kubernetes. By exploiting standalone mode,
one can run Hive on MR3 virtually in any type of cluster. Now installing
Hive on MR3 is as simple as installing Trino/Presto.

4. Performance
Based on 10TB TPC-DS benchmark, Hive on MR3 runs faster than Hive-LLAP
(8074s vs 8680s). It is slightly slower than Trino 418 (7424s vs 8074s),
but returns correct results on all 99 queries, while Trino fails or returns
wrong results on some queries.

5. Java 17
As an experimental feature, MR3 supports Java 17, and Hive on MR3 can run
with Java 17. From the same TPC-DS benchmark, upgrading Java from 8 to 17
yields about 8% speedup (from 8074s to 7415s).

6. Correctness
Hive on MR3 is based on Hive branch-3.1 and has backported over 700
patches. In addition to q-tests included in the source code of Hive, we use
TPC-DS benchmark to check the correctness of query compilation. Hive on MR3
returns correct results on all 99 queries. (Note: The current master branch
of Hive returns wrong results on some queries in TPC-DS.)

7. Misc
Other applications of MR3 include Spark on MR3 and MapReduce on MR3. For
example, you can run MapReduce jobs directly on Kubernetes!

For the full documentation (including quick start guide and release notes),
please see:

https://mr3docs.datamonad.com/

The git repository for Hive on MR3 can be used to build Hive on Tez as well
(by ignoring the last few commits):

https://github.com/mr3project/hive-mr3

Thanks,

--- Sungwoo


Performance Evaluation of Trino, Spark, and Hive on MR3

2023-05-31 Thread Sungwoo Park
Hello Hive users,

With the release of Hive on MR3 1.7, we published an article that compares
Trino, Spark, and Hive on MR3.

https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/

Omitted in the article is the result of running Hive-LLAP included in HDP
3.1.4. In our experiment, Hive-LLAP spends 8680 seconds to execute all 99
queries in the TPC-DS benchmark. All the queries are completed
successfully, but query 70 returns 25 rows (rather than 124 rows).

Thanks,

--- Sungwoo


hive.query.reexecution.stats.persist.scope

2023-05-24 Thread Sungwoo Park
Hi Hive users,

Hive can persist runtime statistics by setting
hive.query.reexecution.stats.persist.scope to 'hiveserver' or 'metastore'
(instead of the default value 'query'). If you have an experience of using
this configuration key in production, could you share it here? (Like the
stability of query execution, speed improvement, etc.)

Thanks,

--- Sungwoo


Re: hive.query.reexecution.stats.persist.scope

2023-05-25 Thread Sungwoo Park
Hi Ayush,

Thank you for letting me know about HIVE-26978. To me, this bug seems like
a small price to pay for the huge benefit from persisting runtime
statistics. Setting the config to 'hiveserver' also seems like a good
compromise.

Thanks,

--- Sungwoo

On Thu, May 25, 2023 at 1:49 AM Ayush Saxena  wrote:

> Hi Sungwoo,
>
> I know one issue: if that config is set to Metastore in that case it does
> create some issues, those entries are persisted in the *RUNTIME_STATS *table
> and if you drop a table the entry still stays, so if you have some drop
> table and you recreate the table and shoot a query before 
> *RuntimeStatsCleanerTask
> *can clean the stale entry, your plans for the newer table will be
> screwed till then.
>
> There is a ticket for that here [1], most probably we need to find a way
> to drop those stats on drop table or make sure the newer tables can figure
> out the stats are stale, the catch is the *RUNTIME_STATS *doesn't have
> table name mapping so to do a drop that should be a good effort, I haven't
> spent much time investigating so there might be better ways as well.
>
> -Ayush
>
> [1] https://issues.apache.org/jira/browse/HIVE-26978
>
> On Wed, 24 May 2023 at 19:53, Sungwoo Park  wrote:
>
>> Hi Hive users,
>>
>> Hive can persist runtime statistics by setting
>> hive.query.reexecution.stats.persist.scope to 'hiveserver' or 'metastore'
>> (instead of the default value 'query'). If you have an experience of using
>> this configuration key in production, could you share it here? (Like the
>> stability of query execution, speed improvement, etc.)
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>


Blog article 'Performance Tuning for Single-table Queries'

2023-12-23 Thread Sungwoo Park
Hello Hive users,

I have published a new blog article 'Performance Tuning for Single-table
Queries'. It shows how to change configuration parameters of Hive and Tez
in order to make simple queries run faster than Spark. Although it
uses Hive on MR3, the technique equally applies to Hive on Tez and
Hive-LLAP.

https://www.datamonad.com/post/2023-12-23-optimize-bi-1.8/

Hope you find it useful.

Cheers,

--- Sungwoo


Re: MR3 1.8 released

2023-12-15 Thread Sungwoo Park
For Chinese users, MR3 1.8 is now shipped in HiDataPlus (along with
Celeborn).

https://mp.weixin.qq.com/s/65bgrnFpXtORlb4FjlPMWA

--- Sungwoo

On Sat, Dec 9, 2023 at 9:08 PM Sungwoo Park  wrote:

> MR3 1.8 released
>
> On behalf of the MR3 team, I am pleased to announce the release of MR3 1.8.
>
> MR3 is an execution engine similar in spirit to MapReduce and Tez which
> has been under development since 2015. Its main application is Hive on MR3.
> You can run Hive on MR3 on Hadoop, on Kubernetes, in standalone mode (which
> does not require Hadoop/Kubernetes), or on a local machine. You can also
> test Hive on MR3 in a single Docker container.
>
> From MR3 1.8, we assume Java 17 by default. For running Hive on MR3 on
> Hadoop, we continue to support Java 8 as well. For Kubernetes and
> standalone mode, we release Hive on MR3 built with Java 17 only.
>
> Please see the release notes for changes new in MR3 1.8. A major new
> feature is that Hive on MR3 can use Apache Celeborn for remote shuffle
> service.
>
> https://mr3docs.datamonad.com/docs/release/
>
> For the performance of Hive on MR3 1.8, please see a blog article "Hive on
> MR3 - from Java 8 to Java 17 (and beating Trino)". On the 10TB TPC-DS
> benchmark, Hive on MR3 1.8 finishes all the queries faster than Trino 418.
>
> https://www.datamonad.com/post/2023-12-09-hivemr3-java17-1.8/
>
> Thank you,
>
> --- Sungwoo
>


MR3 1.8 released

2023-12-09 Thread Sungwoo Park
MR3 1.8 released

On behalf of the MR3 team, I am pleased to announce the release of MR3 1.8.

MR3 is an execution engine similar in spirit to MapReduce and Tez which has
been under development since 2015. Its main application is Hive on MR3. You
can run Hive on MR3 on Hadoop, on Kubernetes, in standalone mode (which
does not require Hadoop/Kubernetes), or on a local machine. You can also
test Hive on MR3 in a single Docker container.

>From MR3 1.8, we assume Java 17 by default. For running Hive on MR3 on
Hadoop, we continue to support Java 8 as well. For Kubernetes and
standalone mode, we release Hive on MR3 built with Java 17 only.

Please see the release notes for changes new in MR3 1.8. A major new
feature is that Hive on MR3 can use Apache Celeborn for remote shuffle
service.

https://mr3docs.datamonad.com/docs/release/

For the performance of Hive on MR3 1.8, please see a blog article "Hive on
MR3 - from Java 8 to Java 17 (and beating Trino)". On the 10TB TPC-DS
benchmark, Hive on MR3 1.8 finishes all the queries faster than Trino 418.

https://www.datamonad.com/post/2023-12-09-hivemr3-java17-1.8/

Thank you,

--- Sungwoo


Re: Docker Hive using tez without hdfs

2024-01-09 Thread Sungwoo Park
Hello,

I don't have an answer to your problem, but if your goal is to quickly test
Hive 3 using Docker, there is an alternative way which uses Hive on MR3.

https://mr3docs.datamonad.com/docs/quick/docker/

You can also run Hive on MR3 on Kubernetes.

Thanks,

--- Sungwoo



On Wed, Jan 10, 2024 at 3:25 PM Sanjay Gupta  wrote:

> Hi,
> Using following docker container to run meta , hiveserver2
>
> https://hub.docker.com/r/apache/hive
> https://github.com/apache/hive/blob/master/packaging/src/docker/
>
> I have configured hive-site.xml to se S3
> When I set in hive.execution.engine to mr hive-site.xml, hive is
> running fine and I can perform queries but setting to tez fails with
> error.
> There is no hdfs but it is running in local mode.
>
> 
> hive.execution.engine
> tez
> 
>
> Any idea how to fix this issue ?
>
> hive
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
> Hive Session ID = 03368207-1904-4c4c-b63e-b29dd28e0a71
>
> Logging initialized using configuration in
> jar:file:/opt/hive/lib/hive-common-3.1.3.jar!/hive-log4j2.properties
> Async: true
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/tez/dag/api/TezConfiguration
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:661)
> at
> org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:308)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:222)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.tez.dag.api.TezConfiguration
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>
>
> --
>
> Thanks
> Sanjay Gupta
>


Re: Docker Hive using tez without hdfs

2024-01-09 Thread Sungwoo Park
As far as I know, Hive-Tez supports local mode, but does not standalone
mode (like Spark). Hive-MR3 supports standalone mode, so you can run it in
any type of cluster.

--- Sungwoo

On Wed, Jan 10, 2024 at 4:22 PM Sanjay Gupta  wrote:

> I can run hive with mr engine in local mode. Does Hive + Tez also
> works in standalone mode ?
>
> On Tue, Jan 9, 2024 at 11:08 PM Sungwoo Park  wrote:
> >
> > Hello,
> >
> > I don't have an answer to your problem, but if your goal is to quickly
> test Hive 3 using Docker, there is an alternative way which uses Hive on
> MR3.
> >
> > https://mr3docs.datamonad.com/docs/quick/docker/
> >
> > You can also run Hive on MR3 on Kubernetes.
> >
> > Thanks,
> >
> > --- Sungwoo
> >
> >
> >
> > On Wed, Jan 10, 2024 at 3:25 PM Sanjay Gupta  wrote:
> >>
> >> Hi,
> >> Using following docker container to run meta , hiveserver2
> >>
> >> https://hub.docker.com/r/apache/hive
> >> https://github.com/apache/hive/blob/master/packaging/src/docker/
> >>
> >> I have configured hive-site.xml to se S3
> >> When I set in hive.execution.engine to mr hive-site.xml, hive is
> >> running fine and I can perform queries but setting to tez fails with
> >> error.
> >> There is no hdfs but it is running in local mode.
> >>
> >> 
> >> hive.execution.engine
> >> tez
> >> 
> >>
> >> Any idea how to fix this issue ?
> >>
> >> hive
> >> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
> >> Hive Session ID = 03368207-1904-4c4c-b63e-b29dd28e0a71
> >>
> >> Logging initialized using configuration in
> >> jar:file:/opt/hive/lib/hive-common-3.1.3.jar!/hive-log4j2.properties
> >> Async: true
> >> Exception in thread "main" java.lang.NoClassDefFoundError:
> >> org/apache/tez/dag/api/TezConfiguration
> >> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:661)
> >> at
> org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
> >> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
> >> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >> at org.apache.hadoop.util.RunJar.run(RunJar.java:308)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:222)
> >> Caused by: java.lang.ClassNotFoundException:
> >> org.apache.tez.dag.api.TezConfiguration
> >> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> >> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> >> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> >> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> >>
> >>
> >> --
> >>
> >> Thanks
> >> Sanjay Gupta
>
>
>
> --
>
> Thanks
> Sanjay Gupta
>


MR3 1.9 and performance evaluation of Trino 435 and Hive-MR3 1.9 using TPC-DS

2024-01-08 Thread Sungwoo Park
Hello Hive users,

MR3 1.9 has been released. For changes, please see the release notes:

https://mr3docs.datamonad.com/docs/release/
https://mr3docs.datamonad.com/docs/release/#patches-backported-in-mr3-19

We evaluated the performance of Trino 435 and Hive on MR3 1.9 using the
TPC-DS benchmark. Please see the blog article:

https://www.datamonad.com/post/2024-01-07-trino-hive-performance-1.9/

Thanks,

--- Sungwoo


Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-29 Thread Sungwoo Park
Thank you for sharing the result. (Does your result imply that HIVE-14187
is introducing an intended bug?)

Another issue that could be of your interest is the connection leak problem
reported in HIVE-20600. Do you see the connection leak problem, or is it
not relevant to your environment (e.g., because you don't use HiveServer2)?

--- Sungwoo

On Fri, Mar 1, 2024 at 9:45 AM Takanobu Asanuma 
wrote:

> Hi Pau and Sungwoo,
>
> Thanks for sharing the information.
>
> We tested a set of simple queries which just referenced the Hive table and
> didn't execute any Hive jobs. The result is below.
>
> No. Version rawstore.impl connectionPoolingType HIVE-14187 QueryTime
> --
> 1   1.2.1   ObjectStoreNoneNot Applied 11:38
> 2   3.1.3   ObjectStoreNoneApplied 34:00
> 3   3.1.3   CachedStoreNoneApplied 25:00
> 4   3.1.3   ObjectStoreHikariCPApplied 21:10
> 5   3.1.3   CachedStoreHikariCPApplied 14:30
> 6   3.1.3   ObjectStoreNoneReverted13:00
> 7   3.1.3   ObjectStoreHikariCPReverted11:23
> --
>
> Initially, we encountered an issue of Hive MetaStore slowness when we
> upgraded from environment No.1 to No.2. As shown in the table, environment
> No.2 showed the worst test results.
>
> A unique aspect of our environment is that we don't use connection
> pooling. After some investigation, we thought that the combination of
> HIVE-14187 and connectionPoolingType=None was negatively impacting
> performance.
> The fastest case in our tests was when we reverted HIVE-14187 and set
> connectionPoolingType=HikariCP (see No.7). Even with connectionPoolingType
> set to None, the environment where we reverted HIVE-14187 still performed
> reasonably well (see No.6).
>
> Please note our investigation is still ongoing and we haven't yet come to
> a conclusion.
>
> Regards,
> - Takanobu
>
> 2024年2月29日(木) 12:18 Sungwoo Park :
>
>> We didn't make any other attempt to fix the problem and just decided not
>> to use CachedStore. However, I think our installation of Metastore based on
>> Hive 3.1.3 is running without any serious problems.
>>
>> Could you share how long it takes to compile typical queries in your
>> environment (with Hive 1 and with Hive 3)?
>>
>> FYI, in our environment, sometimes it takes about 10 seconds to compile a
>> query on TPC-DS 10TB datasets. Specifically, the average compilation time
>> of 103 queries is 1.7 seconds (as reported by Hive), and the longest
>> compilation time is 9.6 seconds (query 49). The compilation time includes
>> the time for accessing Metastore.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>
>> On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma 
>> wrote:
>>
>>> Thanks for your detailed answer!
>>>
>>> In the original email, you reported "the query compilation takes long"
>>> in Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
>>> Thank you for sharing the issue with CachedStore and the JIRA tickets.
>>> I will also try out metastore.stats.fetch.bitvector=true.
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2024年2月28日(水) 18:49 Sungwoo Park :
>>>
>>>> Hello Takanobu,
>>>>
>>>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>>>> different, so I don't know why Metastore responses are very slow. I can
>>>> only share some results of testing CachedStore in Metastore. Please note
>>>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>>>> Hive 3.1.3 (which applies many additional patches).
>>>>
>>>> 1.
>>>> When CachedStore is enabled, column stats are not computed. As a
>>>> result, some queries generate very inefficient plans because of
>>>> wrong/inaccurate stats.
>>>>
>>>> Perhaps this is because not all patches for CachedStore have been
>>>> merged to Hive 3.1.3. For example, these patches are not merged. Or, there
>>>> might be some way to properly configure CachedStore so that it correctly
>>>> computes column stats.
>>>>
>>>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>>>> HIVE-21063: Support statistics in cachedStore for transactional table
>>>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>>>> constraint
>>>>
>>>

Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Sungwoo Park
Hello Takanobu,

We did not test with vanilla Hive 3.1.3 and Metastore databases can be
different, so I don't know why Metastore responses are very slow. I can
only share some results of testing CachedStore in Metastore. Please note
that we did not use vanilla Hive 3.1.3 and instead used our own fork of
Hive 3.1.3 (which applies many additional patches).

1.
When CachedStore is enabled, column stats are not computed. As a result,
some queries generate very inefficient plans because of wrong/inaccurate
stats.

Perhaps this is because not all patches for CachedStore have been merged to
Hive 3.1.3. For example, these patches are not merged. Or, there might be
some way to properly configure CachedStore so that it correctly computes
column stats.

HIVE-20896: CachedStore fail to cache stats in multiple code paths
HIVE-21063: Support statistics in cachedStore for transactional table
HIVE-24258: Data mismatch between CachedStore and ObjectStore for constraint

So, we decided that CachedStore should not be enabled in Hive 3.1.3.

(If anyone is running Hive Metastore 3.1.3 in production with CachedStore
enabled, please let us know how you configure it.)

2.
Setting metastore.stats.fetch.bitvector=true can also help generate more
efficient query plans.

--- Sungwoo


On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
wrote:

> Hi Sungwoo Park,
>
> I'm sorry for the late reply to this old email.
> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
> noticed that the response of the Hive3 MetaStore is very slow.
> We suspect that HIVE-14187 might be causing this slowness.
> Could you tell me if you have resolved this problem? Are there still any
> problems when you enable CachedStore?
>
> Regards,
> - Takanobu
>
> 2018年6月13日(水) 0:37 Sungwoo Park :
>
>> Hello Hive users,
>>
>> I am experience a problem with MetaStore in Hive 3.0.
>>
>> 1. Start MetaStore
>> with 
>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>
>> 2. Generate TPC-DS data.
>>
>> 3. TPC-DS queries run okay and produce correct results. E.g., from query
>> 1:
>> +---+
>> |   c_customer_id   |
>> +---+
>> | CHAA  |
>> | DCAA  |
>> | DDAA  |
>> ...
>> | AAAILIAA  |
>> +---+
>> 100 rows selected (69.901 seconds)
>>
>> However, the query compilation takes long (
>> https://issues.apache.org/jira/browse/HIVE-16520).
>>
>> 4. Now, restart MetaStore with
>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>
>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from query 1:
>> ++
>> | c_customer_id  |
>> ++
>> ++
>> No rows selected (37.448 seconds)
>>
>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>> HiveServer2 produces such log messages:
>>
>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>> c_customer_id
>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>> c_customer_id
>>
>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>> Invalid column stats: No of nulls > cardinality
>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>> Inva

Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Sungwoo Park
We didn't make any other attempt to fix the problem and just decided not to
use CachedStore. However, I think our installation of Metastore based on
Hive 3.1.3 is running without any serious problems.

Could you share how long it takes to compile typical queries in your
environment (with Hive 1 and with Hive 3)?

FYI, in our environment, sometimes it takes about 10 seconds to compile a
query on TPC-DS 10TB datasets. Specifically, the average compilation time
of 103 queries is 1.7 seconds (as reported by Hive), and the longest
compilation time is 9.6 seconds (query 49). The compilation time includes
the time for accessing Metastore.

Thanks,

--- Sungwoo


On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma 
wrote:

> Thanks for your detailed answer!
>
> In the original email, you reported "the query compilation takes long" in
> Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
> Thank you for sharing the issue with CachedStore and the JIRA tickets.
> I will also try out metastore.stats.fetch.bitvector=true.
>
> Regards,
> - Takanobu
>
> 2024年2月28日(水) 18:49 Sungwoo Park :
>
>> Hello Takanobu,
>>
>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>> different, so I don't know why Metastore responses are very slow. I can
>> only share some results of testing CachedStore in Metastore. Please note
>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>> Hive 3.1.3 (which applies many additional patches).
>>
>> 1.
>> When CachedStore is enabled, column stats are not computed. As a result,
>> some queries generate very inefficient plans because of wrong/inaccurate
>> stats.
>>
>> Perhaps this is because not all patches for CachedStore have been merged
>> to Hive 3.1.3. For example, these patches are not merged. Or, there might
>> be some way to properly configure CachedStore so that it correctly computes
>> column stats.
>>
>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>> HIVE-21063: Support statistics in cachedStore for transactional table
>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>> constraint
>>
>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>
>> (If anyone is running Hive Metastore 3.1.3 in production with CachedStore
>> enabled, please let us know how you configure it.)
>>
>> 2.
>> Setting metastore.stats.fetch.bitvector=true can also help generate more
>> efficient query plans.
>>
>> --- Sungwoo
>>
>>
>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
>> wrote:
>>
>>> Hi Sungwoo Park,
>>>
>>> I'm sorry for the late reply to this old email.
>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>> noticed that the response of the Hive3 MetaStore is very slow.
>>> We suspect that HIVE-14187 might be causing this slowness.
>>> Could you tell me if you have resolved this problem? Are there still any
>>> problems when you enable CachedStore?
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2018年6月13日(水) 0:37 Sungwoo Park :
>>>
>>>> Hello Hive users,
>>>>
>>>> I am experience a problem with MetaStore in Hive 3.0.
>>>>
>>>> 1. Start MetaStore
>>>> with 
>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>>>
>>>> 2. Generate TPC-DS data.
>>>>
>>>> 3. TPC-DS queries run okay and produce correct results. E.g., from
>>>> query 1:
>>>> +---+
>>>> |   c_customer_id   |
>>>> +---+
>>>> | CHAA  |
>>>> | DCAA  |
>>>> | DDAA  |
>>>> ...
>>>> | AAAILIAA  |
>>>> +---+
>>>> 100 rows selected (69.901 seconds)
>>>>
>>>> However, the query compilation takes long (
>>>> https://issues.apache.org/jira/browse/HIVE-16520).
>>>>
>>>> 4. Now, restart MetaStore with
>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>>>
>>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from query
>>>> 1:
>>>> ++
>>>> | c_customer_id  |
>>>> ++
>>>> ++
>>>> No rows selected (37.448 seconds)
>>>>
>>>> What I noticed is that with hive.metastore.

Hive-MR3 1.10 released

2024-03-19 Thread Sungwoo Park
Hello Hive users,

We have released Hive on MR3 1.10. MR3 is an execution engine similar to
MapReduce and Tez, and it supports Hadoop, Kubernetes, and standalone mode.
Hive-MR3 uses MR3 for its execution backend in Hive 3.1.3. If you are
interested, please give it a try.

In MR3 1.10, we have re-written the shuffle library in Tez. In the previous
version, all tasks manage fetchers independently of each other. Now all
fetchers inside a container are managed by a common shuffle server.

For those interested in performance comparison, here are the latest results
of testing Hive-MR3 1.9/1.10, Trino 435, and Spark 3.4.1 using the
(original) TPC-DS benchmark with 10TB scale. All the systems were tested
with Java 17.

Hive-MR3 1.9: total 6473 seconds, geo-mean 25.0 seconds.
Hive-MR3 1.10: total 6138 seconds, geo-mean 24.4 seconds.
Trino 435: total 6950 seconds, geo-mean 19.2 seconds. Query 23 returns
wrong results. Query 72 fails.
Spark 3.4.1 (using Parquet instead of ORC): total 19044 seconds, geo-mean
35.9 seconds.

Thank you,

--- Sungwoo


Re: [ANNOUNCE] Apache Hive 4.0.0 Released

2024-04-04 Thread Sungwoo Park
Congratulations and huge thanks to Apache Hive team and contributors for
releasing Hive 4. We have been watching the development of Hive 4 since the
release of Hive 3.1, and it's truly satisfying to witness the resolution of
all the critical issues at last after 5 years. Hive 4 comes with a lot of
new great features, and our initial performance benchmarking indicates that
it comes with a significant improvement over Hive 3 in terms of speed.

--- Sungwoo

On Wed, Apr 3, 2024 at 10:30 PM Okumin  wrote:

> I'm really excited to see the news! I can easily imagine the
> difficulty of testing and shipping Hive 4.0.0 with more than 5k
> commits. I'm proud to have witnessed this moment here.
>
> Thank you!
>
> On Wed, Apr 3, 2024 at 3:07 AM Naveen Gangam  wrote:
> >
> > Thank you for the tremendous amount of work put in by many many folks to
> make this release happen, including projects hive is dependent upon like
> tez.
> >
> > Thank you to all the PMC members, committers and contributors for all
> the work over the past 5+ years in shaping this release.
> >
> > THANK YOU!!!
> >
> > On Sun, Mar 31, 2024 at 8:54 AM Battula, Brahma Reddy 
> wrote:
> >>
> >> Thank you for your hard work and dedication in releasing Apache Hive
> version 4.0.0.
> >>
> >>
> >>
> >> Congratulations to the entire team on this achievement. Keep up the
> great work!
> >>
> >>
> >>
> >> Does this consider as GA.?
> >>
> >>
> >>
> >> And Looks we need to update in the following location also.?
> >>
> >> https://hive.apache.org/general/downloads/
> >>
> >>
> >>
> >>
> >>
> >> From: Denys Kuzmenko 
> >> Date: Saturday, March 30, 2024 at 00:07
> >> To: user@hive.apache.org , d...@hive.apache.org <
> d...@hive.apache.org>
> >> Subject: [ANNOUNCE] Apache Hive 4.0.0 Released
> >>
> >> The Apache Hive team is proud to announce the release of Apache Hive
> >>
> >> version 4.0.0.
> >>
> >>
> >>
> >> The Apache Hive (TM) data warehouse software facilitates querying and
> >>
> >> managing large datasets residing in distributed storage. Built on top
> >>
> >> of Apache Hadoop (TM), it provides, among others:
> >>
> >>
> >>
> >> * Tools to enable easy data extract/transform/load (ETL)
> >>
> >>
> >>
> >> * A mechanism to impose structure on a variety of data formats
> >>
> >>
> >>
> >> * Access to files stored either directly in Apache HDFS (TM) or in other
> >>
> >>   data storage systems such as Apache HBase (TM)
> >>
> >>
> >>
> >> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache
> Spark frameworks. (MapReduce is deprecated, and Spark has been removed so
> the text needs to be modified depending on the release version)
> >>
> >>
> >>
> >> For Hive release details and downloads, please visit:
> >>
> >> https://hive.apache.org/downloads.html
> >>
> >>
> >>
> >> Hive 4.0.0 Release Notes are available here:
> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343343=Text=12310843
> >>
> >>
> >>
> >> We would like to thank the many contributors who made this release
> >>
> >> possible.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> The Apache Hive Team
>