Re: Using Spark on Hive with Hive also using Spark as its execution engine

Jörn Franke Mon, 23 May 2016 16:52:23 -0700

Hi Mich,

I think these comparisons are useful. One interesting aspect could be hardware 
scalability in this context. Additionally different type of computations. 
Furthermore, one could compare Spark and Tez+llap as execution engines. I have 
the gut feeling that  each one can be justified by different use cases.
Nevertheless, there should be always a disclaimer for such comparisons, because 
Spark and Hive are not good for a lot of concurrent lookups of single rows. 
They are not good for frequently write small amounts of data (eg sensor data). 
Here hbase could be more interesting. Other use cases can justify graph 
databases, such as Titan, or text analytics/ data matching using Solr on Hadoop.
Finally, even if you have a lot of data you need to think if you always have to 
process everything. For instance, I have found valid use cases in practice 
where we decided to evaluate 10 machine learning models in parallel on only a 
sample of data and only evaluate the "winning" model of the total of data.


As always it depends :) 

Best regards

P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 
and spark 1.6 with hive 1.2. Maybe they have somewhere described how to manage 
bringing both together. You may check also Apache Bigtop (vendor neutral 
distribution) on how they managed to bring both together.

> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
>  
> I have done a number of extensive tests using Spark-shell with Hive DB and 
> ORC tables.
>  
> Now one issue that we typically face is and I quote:
>  
> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
> not fast enough
> 
> OK but there is a solution now. If you use Spark with Hive and you are on a 
> descent version of Hive >= 0.14, then you can also deploy Spark as execution 
> engine for Hive. That will make your application run pretty fast as you no 
> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you are 
> gaining speed in both querying and storage.
>  
> I have made some comparisons on this set-up and I am sure some of you will 
> find it useful.
>  
> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
> The version of Hive I use in Hive 2
> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out the 
> Hadoop libraries mismatch).
>  
> An example I am using Hive on Spark engine to find the min and max of IDs for 
> a table with 1 billion rows:
>  
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
> stddev(id) from oraclehadoop.dummy;
> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>  
>  
> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>  
> INFO  : Completed compiling 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
> Time taken: 1.911 seconds
> INFO  : Executing 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): 
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>  
> Query Hive on Spark job[0] stages:
> 0
> 1
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId: 
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
> [StageCost]
> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
> INFO  :
> Query Hive on Spark job[0] stages:
> INFO  : 0
> INFO  : 1
> INFO  :
> Status: Running (Hive on Spark job[0])
> INFO  : Job Progress Format
> CurrentTime StageId_StageAttemptId: 
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
> [StageCost]
> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 
> Finished
> Status: Finished successfully in 53.25 seconds
> OK
> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 
> 0(+1)/1
> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 
> 1/1 Finished
> INFO  : Status: Finished successfully in 53.25 seconds
> INFO  : Completed executing 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
> Time taken: 56.337 seconds
> INFO  : OK
> +-----+------------+---------------+-----------------------+--+
> | c0  |     c1     |      c2       |          c3           |
> +-----+------------+---------------+-----------------------+--+
> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
> +-----+------------+---------------+-----------------------+--+
> 1 row selected (58.529 seconds)
>  
> 58 seconds first run with cold cache is pretty good
>  
> And let us compare it with running the same query on map-reduce engine
>  
> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
> versions. Consider using a different execution engine (i.e. spark, tez) or 
> using Hive 1.X releases.
> No rows affected (0.007 seconds)
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
> stddev(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1463956731753_0005, Tracking URL = 
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
> job_1463956731753_0005
> Hadoop job information for Stage-1: number of mappers: 22; number of 
> reducers: 1
> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
> INFO  : Compiling 
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): 
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, 
> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), 
> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, 
> type:double, comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); 
> Time taken: 0.144 seconds
> INFO  : Executing 
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): 
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available 
> in the future versions. Consider using a different execution engine (i.e. 
> spark, tez) or using Hive 1.X releases.
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> INFO  : Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks determined at compile time: 1
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> WARN  : Hadoop command-line option parsing not performed. Implement the Tool 
> interface and execute your application with ToolRunner to remedy this.
> INFO  : number of splits:22
> INFO  : Submitting tokens for job: job_1463956731753_0005
> INFO  : The url to track the job: 
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = 
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
> job_1463956731753_0005
> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number of 
> reducers: 1
> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 4.56 
> sec
> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative 
> CPU 4.56 sec
> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 9.17 
> sec
> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative 
> CPU 9.17 sec
> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 14.04 
> sec
> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative 
> CPU 14.04 sec
> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 18.64 
> sec
> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative 
> CPU 18.64 sec
> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 23.25 
> sec
> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative 
> CPU 23.25 sec
> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 27.84 
> sec
> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative 
> CPU 27.84 sec
> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 32.56 
> sec
> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative 
> CPU 32.56 sec
> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 37.1 
> sec
> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative 
> CPU 37.1 sec
> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 41.74 
> sec
> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative 
> CPU 41.74 sec
> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 46.32 
> sec
> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative 
> CPU 46.32 sec
> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 50.93 
> sec
> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 55.55 
> sec
> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative 
> CPU 50.93 sec
> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative 
> CPU 55.55 sec
> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 60.25 
> sec
> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative 
> CPU 60.25 sec
> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 64.86 
> sec
> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative 
> CPU 64.86 sec
> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 69.41 
> sec
> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative 
> CPU 69.41 sec
> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 74.06 
> sec
> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative 
> CPU 74.06 sec
> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 78.72 
> sec
> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative 
> CPU 78.72 sec
> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 83.32 
> sec
> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative 
> CPU 83.32 sec
> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 87.9 
> sec
> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative 
> CPU 87.9 sec
> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 92.52 
> sec
> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative 
> CPU 92.52 sec
> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 97.35 
> sec
> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative 
> CPU 97.35 sec
> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 99.6 
> sec
> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative 
> CPU 99.6 sec
> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 101.4 sec
> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
> Ended Job = job_1463956731753_0005
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read: 
> 5318569 HDFS Write: 46 SUCCESS
> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
> OK
> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, 
> Cumulative CPU 101.4 sec
> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
> INFO  : Ended Job = job_1463956731753_0005
> INFO  : MapReduce Jobs Launched:
> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS 
> Read: 5318569 HDFS Write: 46 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
> INFO  : Completed executing 
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); 
> Time taken: 142.525 seconds
> INFO  : OK
> +-----+------------+---------------+-----------------------+--+
> | c0  |     c1     |      c2       |          c3           |
> +-----+------------+---------------+-----------------------+--+
> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
> +-----+------------+---------------+-----------------------+--+
> 1 row selected (142.744 seconds)
>  
> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with 
> Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.
>  
> Please also note that I did not use any vendor's build for this purpose. I 
> compiled Spark 1.3.1 myself.
>  
> HTH
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com/
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to