Re: Saving Spark generated table into underlying Hive table using Functional programming

Mich Talebzadeh Mon, 07 Mar 2016 15:36:11 -0800

Ok I solved the problem.

When one uses spark-shell it starts with HiveContext so things work. The
caveat is that any Spark temp table created
with "registerTempTable("TABLE") has to be queried by sqlContext.sql
otherwise that table is NOT visible to HiveContext.sql.


To make this work with projects built with SBT, sqlContext has to be
created from HiveContext  and NOT other way round as shown below:

  def main(args: Array[String]) {
  val conf = new SparkConf().
               setAppName("ImportCSV").
               setMaster("local[12]").
               set("spark.driver.allowMultipleContexts", "true").
               set("spark.hadoop.validateOutputSpecs", "false")
  val sc = new SparkContext(conf)
  // Create sqlContext based on HiveContext
  *val sqlContext = new HiveContext(sc)*

Data in temp table can be queried and saved as follows

val results = sqlContext.sql("SELECT * FROM tmp")  // tmp is spark
temporary table here
val output = "hdfs://rhes564:9000/user/hive/warehouse/test.db/t3"   // The
path for Hive table that must exist
results.write.format("orc").save(output)

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 7 March 2016 at 21:27, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> This is the code
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
> //
> object ImportCSV {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>                setAppName("ImportCSV").
>                setMaster("local[12]").
>                set("spark.driver.allowMultipleContexts", "true").
>                set("spark.hadoop.validateOutputSpecs", "false")
>   val sc = new SparkContext(conf)
>   val sqlContext= new org.apache.spark.sql.SQLContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val df =
> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load("/data/stg/table2")
> val a = df.filter(col("Total") > "").map(x =>
> (x.getString(0),x.getString(1), x.getString(2).substring(1).replace(",",
> "").toDouble, x.getString(3).substring(1).replace(",", "").toDouble,
> x.getString(4).substring(1).replace(",", "").toDouble))
> a.toDF.registerTempTable("tmp")
> HiveContext.sql("use test")
> HiveContext.sql("DROP TABLE IF EXISTS t3")
> var sqltext : String = ""
> sqltext = """
> CREATE TABLE t3 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            timestamp
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> """
> HiveContext.sql(sqltext)
> HiveContext.sql("INSERT INTO TABLE t3 SELECT * FROM tmp")
> HiveContext.sql("SELECT * FROM t3 ORDER BY 1").collect.foreach(println)
> }
> }
>
> I am getting this error when running the above code with spark-submit
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: no such
> table tmp; line 1 pos 35
>         at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:260)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 March 2016 at 21:00, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> So what about if you just start with a hive context, and create your DF
>> using the HiveContext?
>>
>>
>> On Monday, March 7, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have done this Spark-shell and Hive itself so it works.
>>>
>>> I am exploring whether I can do it programmatically. The problem I
>>> encounter was that I tried to register the DF as temporary table. The
>>> problem is that trying to insert from temporary table into Hive table, II
>>> was getting the following error
>>>
>>> sqltext = "INSERT INTO TABLE t3 SELECT * FROM tmp"
>>>
>>> sqlContext.sql(sqltext)
>>>
>>> Tables created with SQLContext must be TEMPORARY. Use a HiveContext
>>> instead.
>>>
>>> When I switched to HiveContext, it could not see the temporary table
>>>
>>> Do decided to save the Spark table as follows:
>>>
>>> val a = df.filter(col("Total") > "").map(x =>
>>> (x.getString(0),x.getString(1), x.getString(2).substring(1).replace(",",
>>> "").toDouble, x.getString(3).substring(1).replace(",", "").toDouble,
>>> x.getString(4).substring(1).replace(",", "").toDouble))
>>>
>>> --delete the file in hdfs if already exists
>>> val hadoopConf = new org.apache.hadoop.conf.Configuration()
>>> val hdfs = org.apache.hadoop.fs.FileSystem.get(new
>>> java.net.URI("hdfs://rhes564:9000"), hadoopConf)
>>> val output = "hdfs://rhes564:9000/user/hduser/t3_parquet"
>>> try { hdfs.delete(new org.apache.hadoop.fs.Path(output), true) } catch {
>>> case _ : Throwable => { } }
>>>
>>> -- save it as Parquet file
>>> a.toDF.saveAsParquetFile(output)
>>>
>>> -- Hive table t3 is created as a simple textfile. ORC did not work!
>>>
>>> HiveContext.sql("LOAD DATA INPATH '/user/hduser/t3_parquet' into table
>>> t3")
>>>
>>> OK that works but very cumbersome.
>>>
>>> I checked the web but there are conflicting attempts to solve this issue.
>>>
>>> Please note that this can be done easily with spark-shell as it is built
>>> in HiveContext.
>>>
>>> Thanks
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>

Re: Saving Spark generated table into underlying Hive table using Functional programming

Reply via email to