Re: Running Hive on Spark

Rajesh Balamohan Wed, 13 Mar 2019 05:06:57 -0700

"Hive on Spark" uses Spark purely as execution engine. It would not get the
benefits of codegen and other optimizations of Spark.


If it is mainly for testing, OOTB parameters should work without issues.

However, Tez has lot better edge than Hive on Spark.

Some of the areas where Hive on Spark needs to catch up are,

* No support for auto reduce parallelism.
* Not full dynamic partition pruning is supported.
* Fetchers can start only when all mappers are complete. This can be a huge
painpoint in lot of cases.
* Have to specify CombinedInputFormat for tackling small files, but that
has issues in splitting.

~Rajesh.B

On Tue, Mar 12, 2019 at 2:25 PM Daniel Mateus Pires <[email protected]>
wrote:

> Hi Rajesh,
>
> I'm trying to further my understanding of the various interactions and
> set-ups for Hive + Spark
>
> My understanding so far is that running queries against the
> SparkThriftServer uses the SparkSQL engine whereas the HiveServer2 + Hive +
> Spark execution engine uses Hive primitives and only uses Spark for the
> actual computations
>
> I get your question about "why would I do that?" But my goal right now is
> to understand "what does it mean if I do that"
>
> Best regards
> Daniel
>
> On Tue 12 Mar 2019, 02:21 Rajesh Balamohan, <[email protected]> wrote:
>
>> Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
>> good enough for this.
>>
>> Is there any specific reason for moving from tez to spark as execution
>> engine?
>>
>> ~Rajesh.B
>>
>> On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires <[email protected]>
>> wrote:
>>
>>> Hi there,
>>>
>>> I would like to run Hive using Spark as the execution engine and I'm
>>> pretty confused with the set up.
>>>
>>> For reference I'm using AWS EMR.
>>>
>>> First, I'm confused at the difference between running Hive with Spark as
>>> its execution engine sending queries to Hive using HiveServer2 (Thrift),
>>> and using the SparkThriftServer (I thought it was built on top of
>>> HiveServer2) ? Could I read more about the differences somewhere ?
>>>
>>> I followed the following docs:
>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>> and after changing the execution engine from the EMR default (tez) to
>>> spark, I can see the difference on the HiveServer2 UI at port 10002 where
>>> now the steps show "spark" as the execution engine.
>>>
>>> However I've set up the following config to get the Spark History Server
>>> displaying queries coming through JDBC and I can see queries sent to the
>>> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
>>> engine of Spark (port 10000)
>>>
>>> set spark.eventLog.enabled=true;
>>> set spark.master=localhost:18080;
>>> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
>>> set spark.executor.memory=512m;
>>> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>>
>>> Thanks!
>>>
>>

Re: Running Hive on Spark

Reply via email to