Re: Trouble trying to get started with hive

2016-07-11 Thread Jörn Franke
Please use a Hadoop distribution to avoid these configuration issues (in the 
beginning).

> On 05 Jul 2016, at 12:06, Kari Pahula  wrote:
> 
> Hi. I'm trying to familiarize myself with Hadoop and various projects related 
> to it.
> 
> I've been following 
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted
> 
> I'd like to start by giving a bit of feedback from the bits I have got to 
> work.
> 
> Should the document tell about using schematool?  The first hive command I 
> tried was hive, since that was what the guide had as the first command. That 
> didn't work without running schematool first, but I got more errors when I 
> tried schematool since hive had already created the metastore_db directory 
> and schematool got confused by that. I had to look up what went wrong and 
> found my answers from stackexhange. In the end, I removed metastore_db and 
> ran "bin/schematool -initSchema -dbType derby".
> 
> After that, the next problem I ran into was, when I tried to connect to the 
> database with beeline:
> Error: Failed to open new session: java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
>  User: kaol is not allowed to impersonate anonymous (state=,code=0)
> 
> Again, I had to go and figure that I needed to have a conf/core-site.xml like 
> this:
> 
> 
> 
>   
> hadoop.proxyuser.hive.groups
> *
>   
>   
> hadoop.proxyuser.hive.hosts
> *
>   
> 
> 
> Should this have worked out of box?
> 
> This is as far as I've gotten and now I'm stumped with trying to create a 
> table. My command is "create table users (user_id int, item_id int, rating 
> int, stamp int);" but all I get as a response is
> Error: Error while processing statement: FAILED: Execution Error, return code 
> 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:file:/user/hive/warehouse/users is not a directory or 
> unable to create one) (state=08S01,code=1)
> 
> I've searched for this error and what I've found is that it's a permission 
> error. However, I've still not found what I should make /user/hive/warehouse 
> to make it work.  Last I tried was to have the directory world writable, to 
> no effect. What could I try next?


Re: Trouble trying to get started with hive

2016-07-11 Thread Shannon Ladymon
Kari,

Perhaps the Getting Started doc should be updated.  What information did
you find missing/incorrect?  Have you been able to get it working?

-Shannon

On Tue, Jul 5, 2016 at 3:06 AM, Kari Pahula  wrote:

> Hi. I'm trying to familiarize myself with Hadoop and various projects
> related to it.
>
> I've been following
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted
>
> I'd like to start by giving a bit of feedback from the bits I have got to
> work.
>
> Should the document tell about using schematool?  The first hive command I
> tried was hive, since that was what the guide had as the first command.
> That didn't work without running schematool first, but I got more errors
> when I tried schematool since hive had already created the metastore_db
> directory and schematool got confused by that. I had to look up what went
> wrong and found my answers from stackexhange. In the end, I removed
> metastore_db and ran "bin/schematool -initSchema -dbType derby".
>
> After that, the next problem I ran into was, when I tried to connect to
> the database with beeline:
> Error: Failed to open new session: java.lang.RuntimeException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
> User: kaol is not allowed to impersonate anonymous (state=,code=0)
>
> Again, I had to go and figure that I needed to have a conf/core-site.xml
> like this:
> 
> 
> 
>   
> hadoop.proxyuser.hive.groups
> *
>   
>   
> hadoop.proxyuser.hive.hosts
> *
>   
> 
>
> Should this have worked out of box?
>
> This is as far as I've gotten and now I'm stumped with trying to create a
> table. My command is "create table users (user_id int, item_id int, rating
> int, stamp int);" but all I get as a response is
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:file:/user/hive/warehouse/users is not a directory or
> unable to create one) (state=08S01,code=1)
>
> I've searched for this error and what I've found is that it's a permission
> error. However, I've still not found what I should make
> /user/hive/warehouse to make it work.  Last I tried was to have the
> directory world writable, to no effect. What could I try next?
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud
:)


   1. I am using Hive on Spark and I have a table of 10GB say with 100
   users concurrently accessing the same partition of ORC table  (last one
   hour or so)
   2. Spark takes data and puts in in memory. I gather only data for that
   partition will be loaded for 100 users. In other words there will be 100
   copies.
   3. Spark unlike RDBMS does not have the notion of hot cache or Most
   Recently Used (MRU) or Least Recently Used. So once the user finishes data
   is released from Spark memory. The next user will load that data again.
   Potentially this is somehow wasteful of resources?
   4. With Tez we only have DAG. It is MR with DAG. So the same algorithm
   will be applied to 100 users session but no memory usage
   5. If I add LLAP, will that be more efficient in terms of memory usage
   compared to Hive or not? Will it keep the data in memory for reuse or not.
   6. What I don't understand what makes Tez and LLAP more efficient
   compared to Spark!

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 21:54, Mich Talebzadeh  wrote:

> In my test I did like for like keeping the systematic the same namely:
>
>
>1. Table was a parquet table of 100 Million rows
>2. The same set up was used for both Hive on Spark and Hive on MR
>3. Spark was very impressive compared to MR on this particular test.
>
>
> Just to see any issues I created an ORC table in in the image of Parquet
> (insert/select from Parquet to ORC) with stats updated for columns etc
>
> These were the results of the same run using ORC table this time:
>
> hive> select max(id) from oraclehadoop.dummy;
>
> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 16.08 seconds
> OK
> 1
> Time taken: 17.775 seconds, Fetched: 1 row(s)
>
> Repeat with MR engine
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
> hive> select max(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Job = job_1468226887011_0008, Tracking URL =
> http://rhes564:8088/proxy/application_1468226887011_0008/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1468226887011_0008
> Hadoop job information for Stage-1: number of mappers: 23; number of
> reducers: 1
> 2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
> 2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
> 16.48 sec
> 2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
> 40.63 sec
> 2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
In my test I did like for like keeping the systematic the same namely:


   1. Table was a parquet table of 100 Million rows
   2. The same set up was used for both Hive on Spark and Hive on MR
   3. Spark was very impressive compared to MR on this particular test.


Just to see any issues I created an ORC table in in the image of Parquet
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of
reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
58.88 sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
80.72 sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU
103.43 sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU
125.93 sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU
147.17 sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU
166.56 sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU
189.29 sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU
211.03 sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU
235.68 sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU
258.27 sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU
278.44 sec
2016-07-11 21:38:49,206 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU
300.35 sec
2016-07-11 21:38:58,524 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU
322.89 sec
2016-07-11 21:39:07,889 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU
344.8 sec
2016-07-11 21:39:16,151 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU
367.77 sec
2016-07-11 21:39:25,456 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU
391.82 sec
2016-07-11 21:39:33,725 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
415.48 sec
2016-07-11 21:39:43,037 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU
436.09 sec
2016-07-11 21:39:51,292 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
459.4 sec
2016-07-11 21:39:59,563 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
477.92 sec
2016-07-11 21:40:05,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
491.72 sec
2016-07-11 21:40:10,921 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
499.37 sec
MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec
Ended Job = 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Gopal Vijayaraghavan

> Status: Finished successfully in 14.12 seconds
> OK
> 1
> Time taken: 14.38 seconds, Fetched: 1 row(s)

That might be an improvement over MR, but that still feels far too slow.


Parquet numbers are in general bad in Hive, but that's because the Parquet
reader gets no actual love from the devs. The community, if it wants to
keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and
cut a significant number of memory copies out of the reader.

The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL
which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for
ORC (actually, it looks more like hive's VectorizedRowBatch than
Tungsten's flat layouts).

But that reader cannot be used in Hive-on-Spark, because it is not a
public reader impl.


Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem
at 10Gb scale with a single 16 box.

hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1466700718395_0256)

---
---
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
PENDING  FAILED  KILLED
---
---
Map 1 ..  llap SUCCEEDED 13 130
0   0   0  
Reducer 2 ..  llap SUCCEEDED  1  10
0   0   0  
---
---
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 0.71 s

---
---
Status: DAG finished successfully in 0.71 seconds

Query Execution Summary
---
---
OPERATIONDURATION
---
---
Compile Query   0.21s
Prepare Plan0.13s
Submit Plan 0.34s
Start DAG   0.23s
Run DAG 0.71s
---
---

Task Execution Summary
---
---
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
OUTPUT_RECORDS
---
---
 Map 1 604.00 00 59,957,438
  13
 Reducer 2 105.00 00 13
   0
---
---

LLAP IO Summary
---
---
  VERTICES ROWGROUPS  META_HIT  META_MISS  DATA_HIT  DATA_MISS  ALLOCATION
USED  TOTAL_IO
---
---
 Map 1  6036 01460B68.86MB491.00MB
479.89MB 7.94s
---
---

OK
0.1
Time taken: 1.669 seconds, Fetched: 1 row(s)
hive(tpch_flat_orc_10)>


This is running against a single 16 core box & I would assume it would
take <1.4s to read twice as much (13 tasks is barely touching the load
factors).

It would probably be a bit faster if the cache had hits, but in general
14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.

Cheers,
Gopal













Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Appreciate all the comments.

Hive on Spark. Spark runs as an execution engine and is only used when you
query Hive. Otherwise it is not running. I run it in Yarn client mode. let
me show you an example

In hive-site xml set the execution engine to be spark to spark. It requires
some configuration but it does work :)

Alternatively log in to hive and do the setting there


set hive.execution.engine=spark;
set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;
set spark.master=yarn-client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.cores=8;
set spark.ui.port=;

Small test ride

First using Hive 2 on Spark 1.3.1 to find max(id) for a 100million rows
parquet table

hive> select max(id) from oraclehadoop.dummy_parquet;

Starting Spark Job = a7752b2b-d73a-45de-aced-ddf02810938d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 17:41:52,386 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:55,409 Stage-2_0: 1(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:56,420 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-11 17:41:58,434 Stage-2_0: 10(+2)/24Stage-3_0: 0/1
2016-07-11 17:41:59,440 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-11 17:42:01,455 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-11 17:42:02,462 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-11 17:42:04,476 Stage-2_0: 23(+1)/24Stage-3_0: 0/1
2016-07-11 17:42:05,483 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished

*Status: Finished successfully in 14.12 seconds*OK
1
Time taken: 14.38 seconds, Fetched: 1 row(s)

--simply switch the engine in hive to MR

hive>
*set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
not be available in the future versions. Consider using a different
execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy_parquet;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Starting Job = job_1468226887011_0005, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0005/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0005
Hadoop job information for Stage-1: number of mappers: 24; number of
reducers: 1
2016-07-11 17:42:46,904 Stage-1 map = 0%,  reduce = 0%
2016-07-11 17:42:56,328 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
31.76 sec
2016-07-11 17:43:05,676 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU
61.78 sec
2016-07-11 17:43:16,091 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
95.44 sec
2016-07-11 17:43:24,419 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
121.6 sec
2016-07-11 17:43:32,734 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU
149.37 sec
2016-07-11 17:43:41,031 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU
177.62 sec
2016-07-11 17:43:48,305 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU
204.92 sec
2016-07-11 17:43:56,580 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU
235.34 sec
2016-07-11 17:44:05,917 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU
262.18 sec
2016-07-11 17:44:14,222 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU
286.21 sec
2016-07-11 17:44:22,502 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU
310.34 sec
2016-07-11 17:44:32,923 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
346.26 sec
2016-07-11 17:44:43,301 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU
379.11 sec
2016-07-11 17:44:53,674 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU
417.9 sec
2016-07-11 17:45:04,001 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU
450.73 sec
2016-07-11 17:45:13,327 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU
476.7 sec
2016-07-11 17:45:22,656 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU
508.97 sec
2016-07-11 17:45:33,002 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU
535.69 sec
2016-07-11 17:45:43,355 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU
573.33 sec
2016-07-11 17:45:52,613 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
605.01 sec
2016-07-11 17:46:02,962 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU
632.38 sec
2016-07-11 17:46:13,316 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU
666.45 sec
2016-07-11 17:46:23,656 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
693.72 sec
2016-07-11 17:46:31,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
714.71 sec
2016-07-11 17:46:36,060 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
721.83 sec
MapReduce Total cumulative CPU time: 12 minutes 1 seconds 830 msec
Ended Job = job_1468226887011_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 24  Reduce: 1   Cumulative CPU: 721.83 sec   HDFS Read:
400442823 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 12 minutes 1 seconds 830 msec
OK
1
*Time taken: 239.532 seconds, Fetched: 1 row(s)*


I leave it 

Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking you

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. 

Tez is ‘vendor’ independent.  ;-) 

Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
I think llap should be in the future a general component so llap + spark can 
make sense. I see tez and spark not as competitors but they have different 
purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond 
that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a 
distribution for anything related to performance , testing etc. because doing 
an own installation is more complex and more difficult to maintain. Performance 
and also features will be less good if you do not use a distribution. Which one 
is up to your choice.

> On 11 Jul 2016, at 17:09, Mich Talebzadeh  wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of 
> mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it 
> was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly 
> valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet 
> why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but 
> they are a bit dated (not being unkind) and cannot be taken as is today. One 
> their concern if I recall was excessive CPU and memory usage of Spark but 
> then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  
> With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, 
> we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like 
> Spark which is becoming a sort of Vogue in Big Data space and moving very 
> fast, why look for another one. Hive does what it says on the Tin and good 
> reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar  wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or 
>> Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one 
>> not to use Spark? I still don't get why TEZ + LLAP is going to be a better 
>> choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
>> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The 
>> challenge for Hive is to achieve this without giving up on the scale and 
>> flexibility that users depend on. This requires a new approach using a 
>> hybrid engine that leverages Tez and something new called  LLAP (Live Long 
>> and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides 
>> the following:
>> Caching and data reuse across queries with compressed columnar data 
>> in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash 
>> joins
>> High throughput IO using Async IO Elevator with dedicated thread and core 
>> per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
>> words what Spark does already and BTW it does not require a daemon running 
>> on any host. Don't take me wrong. It is interesting but this sounds to me 
>> (without testing myself) adding caching capability to TEZ to bring it on par 
>> with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know 
>> anything about Hive. and they promote Impala but that 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar  wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh 
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>- Caching and data reuse across queries with compressed columnar data
>in-memory (off-heap)
>- Multi-threaded execution including reads with predicate pushdown and
>hash joins
>- High throughput IO using Async IO Elevator with dedicated thread and
>core per disk
>- Granular column level security across applications
>- "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>- Spark has a wide user base. I judge this from Spark user group
>traffic
>- TEZ user group has no traffic I am afraid
>- LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:19, Jörn Franke  wrote:
>
> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. 

Re: Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West
Hi Mich,

Correct. We have proof of concepts up and running with both MySQL on RDS
and Aurora. We'd be keen to hear of experiences of others with Aurora in a
Hive metastore database role, primarily as a sanity check. In answer to
your specific points:

   1. 30GB
   2. We don't intend to use ACID in this scenario (currently).

For this application we particularly value:

   - Compatibility (with Hive)
   - High availability
   - Scalability
   - Ease of management

Thanks, Elliot.

On 11 July 2016 at 15:15, Mich Talebzadeh  wrote:

> Hi  Elliot,
>
> Am I correct that you want to put your Hive metastore on Amazon? Is the
> metastore (database/schema) is sitting on on MySQL and you want to migrate
> your MySQL to cloud now?
>
> Two questions that need to be verified
>
>
>1. How big is your current metadata
>2. Do you do a lot of transaction activity using ORC files with
>Insert/Update/Delete that need to communicate with metastore with heartbeat
>etc?
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 July 2016 at 13:58, Elliot West  wrote:
>
>> Hello,
>>
>> Is anyone running the Hive metastore database on Amazon Aurora?:
>> https://aws.amazon.com/rds/aurora/details/. My expectation is that it
>> should work nicely as it is derived from MySQL but I'd be keen to hear of
>> user's experiences with this setup.
>>
>> Many thanks,
>>
>> Elliot.
>>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive 
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not 
to use Spark? I still don't get why TEZ + LLAP is going to be a better choice 
from what you mentioned?
thanking you 
 

On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The 
challenge for Hive is to achieve this without giving up on the scale and 
flexibility that users depend on. This requires a new approach using a hybrid 
engine that leverages Tez and something new called  LLAP (Live Long and 
Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the 
following:   
   - Caching and data reuse across queries with compressed columnar data 
in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash 
joins
   - High throughput IO using Async IO Elevator with dedicated thread and core 
per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
words what Spark does already and BTW it does not require a daemon running on 
any host. Don't take me wrong. It is interesting but this sounds to me (without 
testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory 
caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything 
about Hive. and they promote Impala but that sounds like a sinking ship these 
days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke  wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data

Re: Hive Metastore on Amazon Aurora

2016-07-11 Thread Mich Talebzadeh
Hi  Elliot,

Am I correct that you want to put your Hive metastore on Amazon? Is the
metastore (database/schema) is sitting on on MySQL and you want to migrate
your MySQL to cloud now?

Two questions that need to be verified


   1. How big is your current metadata
   2. Do you do a lot of transaction activity using ORC files with
   Insert/Update/Delete that need to communicate with metastore with heartbeat
   etc?


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 13:58, Elliot West  wrote:

> Hello,
>
> Is anyone running the Hive metastore database on Amazon Aurora?:
> https://aws.amazon.com/rds/aurora/details/. My expectation is that it
> should work nicely as it is derived from MySQL but I'd be keen to hear of
> user's experiences with this setup.
>
> Many thanks,
>
> Elliot.
>


Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West
Hello,

Is anyone running the Hive metastore database on Amazon Aurora?:
https://aws.amazon.com/rds/aurora/details/. My expectation is that it
should work nicely as it is derived from MySQL but I'd be keen to hear of
user's experiences with this setup.

Many thanks,

Elliot.