Re: Using Spark on Hive with Hive also using Spark as its execution engine

Mich Talebzadeh Mon, 30 May 2016 14:33:07 -0700

I think we are going to move to a model that the computation stack will be
separate from storage stack and moreover something like Hive that provides
the means for persistent storage (well HDFS is the one that stores all the
data) will have an in-memory type capability much like what Oracle TimesTen
IMDB does with its big brother Oracle. Now TimesTen is effectively designed
to provide in-memory capability for analytics for Oracle 12c. These
two work like
an index or materialized view.  You write queries against tables -
optimizer figures out whether to use row oriented storage and indexes to
access (Oracle classic) or column non-indexed storage to answer
(TimesTen). just
one optimizer.


I gather Hive will be like that eventually. it will decide based on the
frequency of access where to look for data. Yes we may have 10 TB of data
on disk but how much of it is frequently accessed (hot data). 80-20 rule?
In reality may be just 2TB or most recent partitions etc. The rest is cold
data.

cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 21:59, Michael Segel <msegel_had...@hotmail.com> wrote:

> And you have MapR supporting Apache Drill.
>
> So these are all alternatives to Spark, and its not necessarily an either
> or scenario. You can have both.
>
> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> yep Hortonworks supports Tez for one reason or other which I am going
> hopefully to test it as the query engine for hive. Tthough I think Spark
> will be faster because of its in-memory support.
>
> Also if you are independent then you better off dealing with Spark and
> Hive without the need to support another stack like Tez.
>
> Cloudera support Impala instead of Hive but it is not something I have
> used. .
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com> wrote:
>
>> Mich,
>>
>> Most people use vendor releases because they need to have the support.
>> Hortonworks is the vendor who has the most skin in the game when it comes
>> to Tez.
>>
>> If memory serves, Tez isn’t going to be M/R but a local execution engine?
>> Then LLAP is the in-memory piece to speed up Tez?
>>
>> HTH
>>
>> -Mike
>>
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> thanks I think the problem is that the TEZ user group is exceptionally
>> quiet. Just sent an email to Hive user group to see anyone has managed to
>> built a vendor independent version.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Well I think it is different from MR. It has some optimizations which
>>> you do not find in MR. Especially the LLAP option in Hive2 makes it
>>> interesting.
>>>
>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
>>> is integrated in the Hortonworks distribution.
>>>
>>>
>>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Hi Jorn,
>>>
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>>> from TEZ user group kindly gave a hand but I could not go very far (or may
>>> be I did not make enough efforts) making it work.
>>>
>>> That TEZ user group is very quiet as well.
>>>
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>>> in-memory capability.
>>>
>>> It would be interesting to see what version of TEZ works as execution
>>> engine with Hive.
>>>
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>>> Hive etc as I am sure you already know.
>>>
>>> Cheers,
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote:
>>>
>>>> Very interesting do you plan also a test with TEZ?
>>>>
>>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>>
>>>> Basically took the original table imported using Sqoop and created and
>>>> populated a new ORC table partitioned by year and month into 48 partitions
>>>> as follows:
>>>>
>>>> <sales_partition.PNG>
>>>> 
>>>> Connections use JDBC via beeline. Now for each partition using MR it
>>>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>>>> is just an individual partition and there are 48 partitions.
>>>>
>>>> In contrast doing the same operation with Spark engine took 10 minutes
>>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>>> FinishTime from below
>>>>
>>>> <image.png>
>>>>
>>>> This is by no means indicate that Spark is much better than MR but
>>>> shows that some very good results can ve achieved using Spark engine.
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>>>
>>>>> Whether Hive is the write database for purpose or one is better off
>>>>> with something like Phoenix on Hbase, well the answer is it depends and
>>>>> your mileage varies.
>>>>>
>>>>> So fit for purpose.
>>>>>
>>>>> Ideally what wants is to use the fastest  method to get the results.
>>>>> How fast we confine it to our SLA agreements in production and that helps
>>>>> us from unnecessary further work as we technologists like to play around.
>>>>>
>>>>> So in short, we use Spark most of the time and use Hive as the backend
>>>>> engine for data storage, mainly ORC tables.
>>>>>
>>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>>>>> at the moment it is one of my projects.
>>>>>
>>>>> We do not use any vendor's products as it enables us to move away
>>>>> from being tied down after years of SAP, Oracle and MS dependency to yet
>>>>> another vendor. Besides there is some politics going on with one promoting
>>>>> Tez and another Spark as a backend. That is fine but obviously we prefer 
>>>>> an
>>>>> independent assessment ourselves.
>>>>>
>>>>> My gut feeling is that one needs to look at the use case. Recently we
>>>>> had to import a very large table from Oracle to Hive and decided to use
>>>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used
>>>>> JDBC connection with temp table and it was good. We could have used sqoop
>>>>> but decided to settle for Spark so it all depends on use case.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Thanks for very useful stats.
>>>>>>
>>>>>> Did you have any benchmark for using Spark as backend engine for Hive
>>>>>> vs using Spark thrift server (and run spark code for hive queries)? We 
>>>>>> are
>>>>>> using later but it will be very useful to remove thriftserver, if we can.
>>>>>>
>>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Mich,
>>>>>>>
>>>>>>> I think these comparisons are useful. One interesting aspect could
>>>>>>> be hardware scalability in this context. Additionally different type of
>>>>>>> computations. Furthermore, one could compare Spark and Tez+llap as
>>>>>>> execution engines. I have the gut feeling that  each one can be 
>>>>>>> justified
>>>>>>> by different use cases.
>>>>>>> Nevertheless, there should be always a disclaimer for such
>>>>>>> comparisons, because Spark and Hive are not good for a lot of concurrent
>>>>>>> lookups of single rows. They are not good for frequently write small
>>>>>>> amounts of data (eg sensor data). Here hbase could be more interesting.
>>>>>>> Other use cases can justify graph databases, such as Titan, or text
>>>>>>> analytics/ data matching using Solr on Hadoop.
>>>>>>> Finally, even if you have a lot of data you need to think if you
>>>>>>> always have to process everything. For instance, I have found valid use
>>>>>>> cases in practice where we decided to evaluate 10 machine learning 
>>>>>>> models
>>>>>>> in parallel on only a sample of data and only evaluate the "winning" 
>>>>>>> model
>>>>>>> of the total of data.
>>>>>>>
>>>>>>> As always it depends :)
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with
>>>>>>> hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere 
>>>>>>> described
>>>>>>> how to manage bringing both together. You may check also Apache Bigtop
>>>>>>> (vendor neutral distribution) on how they managed to bring both 
>>>>>>> together.
>>>>>>>
>>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I have done a number of extensive tests using Spark-shell with Hive
>>>>>>> DB and ORC tables.
>>>>>>>
>>>>>>>
>>>>>>> Now one issue that we typically face is and I quote:
>>>>>>>
>>>>>>>
>>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data
>>>>>>> it is not fast enough
>>>>>>>
>>>>>>> OK but there is a solution now. If you use Spark with Hive and you
>>>>>>> are on a descent version of Hive >= 0.14, then you can also deploy 
>>>>>>> Spark as
>>>>>>> execution engine for Hive. That will make your application run pretty 
>>>>>>> fast
>>>>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a 
>>>>>>> nutshell
>>>>>>> what you are gaining speed in both querying and storage.
>>>>>>>
>>>>>>>
>>>>>>> I have made some comparisons on this set-up and I am sure some of
>>>>>>> you will find it useful.
>>>>>>>
>>>>>>>
>>>>>>> The version of Spark I use for Spark queries (Spark as query tool)
>>>>>>> is 1.6.
>>>>>>> The version of Hive I use in Hive 2
>>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It
>>>>>>> works and frankly Spark 1.3.1 as an execution engine is adequate (until 
>>>>>>> we
>>>>>>> sort out the Hadoop libraries mismatch).
>>>>>>>
>>>>>>>
>>>>>>> An example I am using Hive on Spark engine to find the min and max
>>>>>>> of IDs for a table with 1 billion rows:
>>>>>>>
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>>>
>>>>>>>
>>>>>>> INFO  : Completed compiling
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>> Time taken: 1.911 seconds
>>>>>>> INFO  : Executing
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> INFO  : Query ID =
>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>> INFO  : Total jobs = 1
>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>>
>>>>>>>
>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>> 0
>>>>>>> 1
>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>> Job Progress Format
>>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>> [StageCost]
>>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>>> INFO  :
>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>> INFO  : 0
>>>>>>> INFO  : 1
>>>>>>> INFO  :
>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>> INFO  : Job Progress Format
>>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>> [StageCost]
>>>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0:
>>>>>>> 0/1
>>>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0:
>>>>>>> 0/1
>>>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0:
>>>>>>> 0/1
>>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>>> 0(+1)/1
>>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>>> 1/1 Finished
>>>>>>> Status: Finished successfully in 53.25 seconds
>>>>>>> OK
>>>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
>>>>>>> Stage-1_0: 0(+1)/1
>>>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
>>>>>>> Stage-1_0: 1/1 Finished
>>>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>>>> INFO  : Completed executing
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>> Time taken: 56.337 seconds
>>>>>>> INFO  : OK
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> 1 row selected (58.529 seconds)
>>>>>>>
>>>>>>>
>>>>>>> 58 seconds first run with cold cache is pretty good
>>>>>>>
>>>>>>>
>>>>>>> And let us compare it with running the same query on map-reduce
>>>>>>> engine
>>>>>>>
>>>>>>>
>>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>>>>>> future versions. Consider using a different execution engine (i.e. 
>>>>>>> spark,
>>>>>>> tez) or using Hive 1.X releases.
>>>>>>> No rows affected (0.007 seconds)
>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>>> in the future versions. Consider using a different execution engine 
>>>>>>> (i.e.
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>> Total jobs = 1
>>>>>>> Launching Job 1 out of 1
>>>>>>> Number of reduce tasks determined at compile time: 1
>>>>>>> In order to change the average load for a reducer (in bytes):
>>>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>> In order to limit the maximum number of reducers:
>>>>>>>   set hive.exec.reducers.max=<number>
>>>>>>> In order to set a constant number of reducers:
>>>>>>>   set mapreduce.job.reduces=<number>
>>>>>>> Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>>>>>> job_1463956731753_0005
>>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of
>>>>>>> reducers: 1
>>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>> INFO  : Compiling
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> INFO  : Semantic Analysis Completed
>>>>>>> INFO  : Returning Hive schema:
>>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null),
>>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,
>>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double,
>>>>>>> comment:null)], properties:null)
>>>>>>> INFO  : Completed compiling
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>> Time taken: 0.144 seconds
>>>>>>> INFO  : Executing
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>>> in the future versions. Consider using a different execution engine 
>>>>>>> (i.e.
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>>>>>> available in the future versions. Consider using a different execution
>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>>> in the future versions. Consider using a different execution engine 
>>>>>>> (i.e.
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> INFO  : Query ID =
>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>> INFO  : Total jobs = 1
>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>>>> INFO  : In order to set a constant number of reducers:
>>>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>>>> WARN  : Hadoop command-line option parsing not performed. Implement
>>>>>>> the Tool interface and execute your application with ToolRunner to 
>>>>>>> remedy
>>>>>>> this.
>>>>>>> INFO  : number of splits:22
>>>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>>>> INFO  : The url to track the job:
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job
>>>>>>> -kill job_1463956731753_0005
>>>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22;
>>>>>>> number of reducers: 1
>>>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
>>>>>>> CPU 4.56 sec
>>>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%,
>>>>>>> Cumulative CPU 4.56 sec
>>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
>>>>>>> CPU 9.17 sec
>>>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%,
>>>>>>> Cumulative CPU 9.17 sec
>>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative
>>>>>>> CPU 14.04 sec
>>>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
>>>>>>> Cumulative CPU 14.04 sec
>>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative
>>>>>>> CPU 18.64 sec
>>>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
>>>>>>> Cumulative CPU 18.64 sec
>>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative
>>>>>>> CPU 23.25 sec
>>>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
>>>>>>> Cumulative CPU 23.25 sec
>>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative
>>>>>>> CPU 27.84 sec
>>>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
>>>>>>> Cumulative CPU 27.84 sec
>>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative
>>>>>>> CPU 32.56 sec
>>>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
>>>>>>> Cumulative CPU 32.56 sec
>>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative
>>>>>>> CPU 37.1 sec
>>>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
>>>>>>> Cumulative CPU 37.1 sec
>>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative
>>>>>>> CPU 41.74 sec
>>>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
>>>>>>> Cumulative CPU 41.74 sec
>>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative
>>>>>>> CPU 46.32 sec
>>>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
>>>>>>> Cumulative CPU 46.32 sec
>>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative
>>>>>>> CPU 50.93 sec
>>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative
>>>>>>> CPU 55.55 sec
>>>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
>>>>>>> Cumulative CPU 50.93 sec
>>>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
>>>>>>> Cumulative CPU 55.55 sec
>>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative
>>>>>>> CPU 60.25 sec
>>>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
>>>>>>> Cumulative CPU 60.25 sec
>>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative
>>>>>>> CPU 64.86 sec
>>>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
>>>>>>> Cumulative CPU 64.86 sec
>>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative
>>>>>>> CPU 69.41 sec
>>>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
>>>>>>> Cumulative CPU 69.41 sec
>>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative
>>>>>>> CPU 74.06 sec
>>>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
>>>>>>> Cumulative CPU 74.06 sec
>>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative
>>>>>>> CPU 78.72 sec
>>>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
>>>>>>> Cumulative CPU 78.72 sec
>>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative
>>>>>>> CPU 83.32 sec
>>>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
>>>>>>> Cumulative CPU 83.32 sec
>>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative
>>>>>>> CPU 87.9 sec
>>>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
>>>>>>> Cumulative CPU 87.9 sec
>>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative
>>>>>>> CPU 92.52 sec
>>>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
>>>>>>> Cumulative CPU 92.52 sec
>>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative
>>>>>>> CPU 97.35 sec
>>>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
>>>>>>> Cumulative CPU 97.35 sec
>>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative
>>>>>>> CPU 99.6 sec
>>>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>>>>>> Cumulative CPU 99.6 sec
>>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>>> Cumulative CPU 101.4 sec
>>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>>>> Ended Job = job_1463956731753_0005
>>>>>>> MapReduce Jobs Launched:
>>>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS
>>>>>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>>> OK
>>>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>>> Cumulative CPU 101.4 sec
>>>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds
>>>>>>> 400 msec
>>>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>>>> INFO  : MapReduce Jobs Launched:
>>>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4
>>>>>>> sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>>> INFO  : Completed executing
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>> Time taken: 142.525 seconds
>>>>>>> INFO  : OK
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> 1 row selected (142.744 seconds)
>>>>>>>
>>>>>>>
>>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds
>>>>>>> with Hive on Spark. So you can obviously gain pretty well by using Hive 
>>>>>>> on
>>>>>>> Spark.
>>>>>>>
>>>>>>>
>>>>>>> Please also note that I did not use any vendor's build for this
>>>>>>> purpose. I compiled Spark 1.3.1 myself.
>>>>>>>
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to