RE: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Thank you so much Mich! Although a bit older, this is the most detailed 
comparison I’ve read on the subject. Thanks again.

Regards,
-Manu

From: Mich Talebzadeh 
Sent: Tuesday, October 06, 2020 12:37 PM
To: user 
Subject: Re: Hive using Spark engine vs native spark with hive integration.


EXTERNAL
Hi Manu,

In the past (July 2016), I made a presentation organised by then Hortonworks in 
London titled "Query Engines for Hive: MR, Spark, Tez with LLAP – 
Considerations! "

The PDF presentation is 
here<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftalebzadehmich.files.wordpress.com%2F2016%2F08%2Fhive_on_spark_only.pdf=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=8sYj7ps6GdC1QWAqaQdbIdd9c5PqCZ0IkRwvalLpYe8%3D=0>.
 With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala and 
running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there are 
plenty of examples in Spark web 
https://spark.apache.org/examples.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fexamples.html=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=lQWc7VLCia7VyhLohawAaStXnYX1ShbN%2FmU5kAjfaBQ%3D=0>

Also I suggest that you use Scala as Spark itself is written in Scala (though 
Python is more popular with Data Science guys).

HTH

[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3DAAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962415176=QB0525D6xXin7RdcFYdkOAWKARki6uFBq2GQcdNJ0dc%3D=0>







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 6 Oct 2020 at 16:47, Manu Jacob 
mailto:manu.ja...@sas.com>> wrote:
Hi All,

Not sure if I need to ask this question on hive community or spark community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.

  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.


Any advise on this is much appreciated.


Thanks,
-Manu


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Mich Talebzadeh
Hi Manu,

In the past (July 2016), I made a presentation organised by then
Hortonworks in London titled "Query Engines for Hive: MR, Spark, Tez with
LLAP – Considerations! "

The PDF presentation is here
.
With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala
and running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there
are plenty of examples in Spark web https://spark.apache.org/examples.html

Also I suggest that you use Scala as Spark itself is written in Scala
(though Python is more popular with Data Science guys).

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Oct 2020 at 16:47, Manu Jacob  wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on hive community or spark
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advise on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread 刘虓
hi,
if you are already running hive with tez,the perf gain won't be obvious
camparing with spark.
I'd recommend experimenting with spark on something new until a better
understanding is formed

Manu Jacob 于2020年10月6日 周二23:47写道:

> Hi All,
>
>
>
> Not sure if I need to ask this question on hive community or spark
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advise on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>