Re: Is a Hive installation necessary for Spark SQL?

Mich Talebzadeh Sun, 25 Apr 2021 09:02:13 -0700

Hi,

I don't know much about delta but your statement below


df.createOrReplaceTempView("myTable")
res = spark.sql("select * from myTable")


The so-called TempView is a reference to a hash table in memory. That is,
you are mapping your dataframe* df *to a hash table in memory and it is
transient, meaning that it is not persistent.


I have never heard of partition discovery or optimisation of  sql with a
tempView. Hive comes into play when you persist the data to disk.


df.createOrReplaceTempView("myTable")

Then you can either store it as below on Hive

from pyspark.sql import HiveContext
df.repartition(2).partitionBy("ID").write.mode("overwrite").saveAsTable("test.hiveTable")
-- test database in Hive has to exist


or


fullyQualifiedTableName="test.hiveTable"

sqltext  = ""

if (spark.sql("SHOW TABLES IN test like 'hiveTable'").count() == 1):

  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]

  print ("number of rows is ",rows)

else:

  print(f"""\nTable {fullyQualifiedTableName} does not exist, creating
table ")

  sqltext = f"""

     CREATE TABLE {fullyQualifiedTableName}(

       ID INT

     , CLUSTERED INT

     , SCATTERED INT

     , RANDOMISED INT

     , RANDOM_STRING VARCHAR(50)

     , SMALL_VC VARCHAR(50)

     , PADDING  VARCHAR(4000)

    )

    STORED AS PARQUET

    """

  spark.sql(sqltext)

sqltext = f"""

  INSERT INTO TABLE {fullyQualifiedTableName}

  SELECT

          ID

        , CLUSTERED

        , SCATTERED

        , RANDOMISED

        , RANDOM_STRING

        , SMALL_VC

        , PADDING

  FROM myTable

  """

spark.sql(sqltext)

In either case there will be a table in the Hive test database called
myTable. The advantage of Hive under the hood with Spark SQL is that
Spark-hive has an efficient API that does not require JDBC connection from
Spark to the underlying database in Hive. Other databases that support JDBC
connection will need to use JDBC API from Spark


def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver,
mode):

    try:

        dataFrame. \

            write. \

            format("jdbc"). \

            option("url", url). \

            option("dbtable", tableName). \

            option("user", user). \

            option("password", password). \

            option("driver", driver). \

            mode(mode). \

            save()

    except Exception as e:

        print(f"""{e}, quitting""")

        sys.exit(1)


So in summary it is a good idea to keep Hive there.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 25 Apr 2021 at 10:57, chia kang ren <kangren.c...@gmail.com> wrote:

> Does it make sense to keep a Hive installation when your parquet files
> come with a transactional metadata layer like Delta Lake / Apache Iceberg?
>
> My understanding from this:
> https://github.com/delta-io/delta/issues/85
>
> is that Hive is no longer necessary in a Spark cluster other than
> discovering where the table is stored. Hence, we can simply do something
> like:
> ```
> df = spark.read.delta($LOCATION)
> df.createOrReplaceTempView("myTable")
> res = spark.sql("select * from myTable")
> ```
> and this approach still gets all the benefits of having the metadata for
> partition discovery / SQL optimization? With Delta, the Hive metastore
> should only store a pointer from the table name to the path of the table,
> and all other metadata will come from the Delta log, which will be
> processed in Spark.
>
> One reason i can think of keeping Hive is to keep track of other data
> sources that don't necessarily have a Delta / Iceberg transactional
> metadata layer. But i'm not sure if it's still worth it, are there any use
> cases i might have missed out on keeping a Hive installation after
> migrating to Delta / Iceberg?
>
> Please correct me if i've used any terms wrongly.
>
> On Sun, Apr 25, 2021 at 5:42 PM chia kang ren <kangren.c...@gmail.com>
> wrote:
>
>> Does it make sense to keep a Hive installation when your parquet files
>> come with a transactional metadata layer like Delta Lake / Apache Iceberg?
>>
>> My understanding from this:
>> https://github.com/delta-io/delta/issues/85
>>
>> is that Hive is no longer necessary in a Spark cluster other than
>> discovering where the table is stored. Hence, we can simply do something
>> like:
>> ```
>> df = spark.read.delta($LOCATION)
>> df.createOrReplaceTempView("myTable")
>> res = spark.sql("select * from myTable")
>> ```
>> and this approach still gets all the benefits of having the metadata for
>> partition discovery / SQL optimization? With Delta, the Hive metastore
>> should only store a pointer from the table name to the path of the table,
>> and all other metadata will come from the Delta log, which will be
>> processed in Spark.
>>
>> One reason i can think of keeping Hive is to keep track of other data
>> sources that don't necessarily have a Delta / Iceberg transactional
>> metadata layer. But i'm not sure if it's still worth it, are there any use
>> cases i might have missed out on keeping a Hive installation after
>> migrating to Delta / Iceberg?
>>
>> Please correct me if i've used any terms wrongly.
>>
>

Re: Is a Hive installation necessary for Spark SQL?

Reply via email to