Re: Hive From Spark: Jdbc VS sparkContext

Kabeer Ahmed Fri, 13 Oct 2017 03:02:32 -0700

My take on this might sound a bit different. Here are few points to consider 
below:

1. Going through  Hive JDBC means that the application is restricted by the # 
of queries that can be compiled. HS2 can only compile one SQL at a time and if 
users have bad SQL, it can take a long time just to compile (not map reduce). 
This will reduce the query throughput i.e. # of queries you can fire through 
the JDBC.

2. Going through Hive JDBC does have an advantage that HMS service is 
protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does 
protect HMS from crashing - because at the end of the day retrieving metadata 
about a Hive table that may have millions or simply put 1000s of partitions 
hits jvm limit on the array size that it can hold for the metadata retrieved. 
JVM array size limit is hit and there is a crash on HMS. So in effect this is 
good to have to protect HMS & the relational database on its back end.

Note: Hive community does propose to move the database to HBase that scales but 
I dont think this will get implemented sooner.

3. Going through the SparkContext, it directly interfaces with the Hive 
MetaStore. I have tried to put a sequence of code flow below. The bit I didnt 
have time to dive into is that I believe if the table is really large i.e. say 
partitions in the table are more than 32K (size of a short) then some sort of 
slicing does occur (I didnt have time to dive and get this piece of code but 
from experience this does seem to occur).

Code flow:
Spark uses Hive External catalog - goo.gl/7CZcDw
HiveClient version of getPartitions is -> goo.gl/ZAEsqQ
HiveClientImpl of getPartitions is: -> goo.gl/msPrr5
The Hive call is made at: -> goo.gl/TB4NFU
ThriftHiveMetastore.java ->  get_partitions_ps_with_auth

-1 value is sent within Spark all the way throughout to Hive Metastore thrift. 
So in effect for large tables at a time 32K partitions are retrieved. This also 
has led to a few HMS crashes but I am yet to identify if this is really the 
cause.

Based on the 3 points above, I would prefer to use SparkContext. If the cause 
of crash is indeed high # of partitions retrieval, then I may opt for the JDBC 
route.

Thanks
Kabeer.

On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote:
>> In case a table has a few
>> million records, it all goes through the driver.
>
> This sounds clear in JDBC mode, the driver get all the rows and then it
> spreads the RDD over the executors.
>
> I d'say that most use cases deal with SQL to aggregate huge datasets,
> and retrieve small amount of rows to be then transformed for ML tasks.
> Then using JDBC offers the robustness of HIVE to produce a small aggregated
> dataset into spark. While using SPARK SQL uses RDD to produce the small
> one from huge.
>
> Not very clear how SPARK SQL deal with huge HIVE table. Does it load
> everything into memory and crash, or does this never happend?
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

--
Sent using Dekko from my Ubuntu device

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Hive From Spark: Jdbc VS sparkContext

Reply via email to