df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark
version 1.6.0
df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark
version 1.4
--
Jacky Wang
At 2016-02-21 17:35:53, "Mich Talebzadeh" <[email protected]> wrote:
I looked at doc on this. It is not clear what goes behind the scene. Very
little documentation on it
First in Hive a database has to exist before it can be used so sql(“use
mytable”) will not create a database for you.
Also you cannot call your table mytable in database mytable!
Remember in Hive the hierarchy is the database followed by tables.
If you want to create a Hive database (I would not for every table) for this
purpose you can do
scala> sql("create database if not exists mytable_db")
res10: org.apache.spark.sql.DataFrame = [result: string]
scala> sql("use mytable_db")
res12: org.apache.spark.sql.DataFrame = [result: string]
This puts you in the context of mytable_db database
If you do
hdfs dfs -ls /user/hive/warehouse
You will see a directory called mytable_db.db is created
/user/hive/warehouse/mytable.db
Then you can create a table in Hive in mytable_db if you wish. The way I do it
personally is to register your DF as a temporary table and do insert/select
into Hive table
scala> sql("use mytable_db")
res21: org.apache.spark.sql.DataFrame = [result: string]
scala> """
| CREATE TABLE mytable (
| INVOICENUMBER INT
| ,PAYMENTDATE timestamp
| ,NET DECIMAL(20,2)
| ,VAT DECIMAL(20,2)
| ,TOTAL DECIMAL(20,2)
| )
| COMMENT 'a test table'
| STORED AS ORC
| TBLPROPERTIES ( "orc.compress"="ZLIB" )
| """
res22: String =
"
CREATE TABLE mytable (
INVOICENUMBER INT
,PAYMENTDATE timestamp
,NET DECIMAL(20,2)
,VAT DECIMAL(20,2)
,TOTAL DECIMAL(20,2)
)
COMMENT 'a test table'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
"
scala> sql(sqltext)
res6: org.apache.spark.sql.DataFrame = [result: string]
My DF is called “a” below so I register it as a temp table called tmp
a.toDF.registerTempTable("tmp")
Then just insert/select into Hicve table mytable from tmp
scala> sqltext = "INSERT INTO mytable SELECT * FROM tmp"
sqltext: String = INSERT INTO mytable SELECT * FROM tmp
scala> sql(sqltext)
res10: org.apache.spark.sql.DataFrame = []
scala> sql("select count(1) from mytable").show
+---+
|_c0|
+---+
| 65|
|
|
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees
accept any responsibility.
From: Glen [mailto:[email protected]]
Sent: 21 February 2016 03:26
To: gen tang <[email protected]>
Cc:[email protected]
Subject: Re: how to set database in DataFrame.saveAsTable?
Any example code?
In pyspark:
sqlContex.sql("use mytable")
my_df.saveAsTable("tmp_spark_debug", mode="overwrite")
1. The code above seems not register the table in hive. I have to create table
from hdfs in hive, it reports some format error: rcformat and parquet.
2. Rerun the saveAsTable using mode="overwrite" in saveAsTable, it reports the
table already exists.
3. Sometimes it creates a directory in hive/warehouse/tmp_spark_debug, not in
hive/warehouse/mytable/tmp_spark_debug.
My goal is simple:
df.saveAsTable('blablabla') // create a hive table in some database, then it
can be visited by hive.
I tried lots of time, it seems there are lots of bug in pyspark. Or my mehtod
is wrong?
2016-02-21 10:04 GMT+08:00 gen tang <[email protected]>:
Hi,
You can use
sqlContext.sql("use <your database>")
before use dataframe.saveAsTable
Hope it could be helpful
Cheers
Gen
On Sun, Feb 21, 2016 at 9:55 AM, Glen <[email protected]> wrote:
For dataframe in spark, so the table can be visited by hive.
--
Jacky Wang
--
Jacky Wang