Hi, I don't know much about delta but your statement below
df.createOrReplaceTempView("myTable") res = spark.sql("select * from myTable") The so-called TempView is a reference to a hash table in memory. That is, you are mapping your dataframe* df *to a hash table in memory and it is transient, meaning that it is not persistent. I have never heard of partition discovery or optimisation of sql with a tempView. Hive comes into play when you persist the data to disk. df.createOrReplaceTempView("myTable") Then you can either store it as below on Hive from pyspark.sql import HiveContext df.repartition(2).partitionBy("ID").write.mode("overwrite").saveAsTable("test.hiveTable") -- test database in Hive has to exist or fullyQualifiedTableName="test.hiveTable" sqltext = "" if (spark.sql("SHOW TABLES IN test like 'hiveTable'").count() == 1): rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0] print ("number of rows is ",rows) else: print(f"""\nTable {fullyQualifiedTableName} does not exist, creating table ") sqltext = f""" CREATE TABLE {fullyQualifiedTableName}( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000) ) STORED AS PARQUET """ spark.sql(sqltext) sqltext = f""" INSERT INTO TABLE {fullyQualifiedTableName} SELECT ID , CLUSTERED , SCATTERED , RANDOMISED , RANDOM_STRING , SMALL_VC , PADDING FROM myTable """ spark.sql(sqltext) In either case there will be a table in the Hive test database called myTable. The advantage of Hive under the hood with Spark SQL is that Spark-hive has an efficient API that does not require JDBC connection from Spark to the underlying database in Hive. Other databases that support JDBC connection will need to use JDBC API from Spark def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver, mode): try: dataFrame. \ write. \ format("jdbc"). \ option("url", url). \ option("dbtable", tableName). \ option("user", user). \ option("password", password). \ option("driver", driver). \ mode(mode). \ save() except Exception as e: print(f"""{e}, quitting""") sys.exit(1) So in summary it is a good idea to keep Hive there. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sun, 25 Apr 2021 at 10:57, chia kang ren <kangren.c...@gmail.com> wrote: > Does it make sense to keep a Hive installation when your parquet files > come with a transactional metadata layer like Delta Lake / Apache Iceberg? > > My understanding from this: > https://github.com/delta-io/delta/issues/85 > > is that Hive is no longer necessary in a Spark cluster other than > discovering where the table is stored. Hence, we can simply do something > like: > ``` > df = spark.read.delta($LOCATION) > df.createOrReplaceTempView("myTable") > res = spark.sql("select * from myTable") > ``` > and this approach still gets all the benefits of having the metadata for > partition discovery / SQL optimization? With Delta, the Hive metastore > should only store a pointer from the table name to the path of the table, > and all other metadata will come from the Delta log, which will be > processed in Spark. > > One reason i can think of keeping Hive is to keep track of other data > sources that don't necessarily have a Delta / Iceberg transactional > metadata layer. But i'm not sure if it's still worth it, are there any use > cases i might have missed out on keeping a Hive installation after > migrating to Delta / Iceberg? > > Please correct me if i've used any terms wrongly. > > On Sun, Apr 25, 2021 at 5:42 PM chia kang ren <kangren.c...@gmail.com> > wrote: > >> Does it make sense to keep a Hive installation when your parquet files >> come with a transactional metadata layer like Delta Lake / Apache Iceberg? >> >> My understanding from this: >> https://github.com/delta-io/delta/issues/85 >> >> is that Hive is no longer necessary in a Spark cluster other than >> discovering where the table is stored. Hence, we can simply do something >> like: >> ``` >> df = spark.read.delta($LOCATION) >> df.createOrReplaceTempView("myTable") >> res = spark.sql("select * from myTable") >> ``` >> and this approach still gets all the benefits of having the metadata for >> partition discovery / SQL optimization? With Delta, the Hive metastore >> should only store a pointer from the table name to the path of the table, >> and all other metadata will come from the Delta log, which will be >> processed in Spark. >> >> One reason i can think of keeping Hive is to keep track of other data >> sources that don't necessarily have a Delta / Iceberg transactional >> metadata layer. But i'm not sure if it's still worth it, are there any use >> cases i might have missed out on keeping a Hive installation after >> migrating to Delta / Iceberg? >> >> Please correct me if i've used any terms wrongly. >> >