Re: Spark Dataframe and HIVE

Patrick Alwell Fri, 09 Feb 2018 07:41:00 -0800

Might sound silly, but are you using a Hive context?
What errors do the Hive query results return?


spark = SparkSession.builder.enableHiveSupport().getOrCreate()

The second part of your questions, you are creating a temp table and then 
subsequently creating another table from that temp view. Doesn’t seem like you 
are reading the table from the spark or hive warehouse.

This works fine for me; albeit I was using spark thrift to communicate with my 
directory of choice.

from pyspark import SparkContext
from pyspark.sql import SparkSession, Row, types
from pyspark.sql.types import *
from pyspark.sql import functions as f
from decimal import *
from datetime import datetime

# instantiate our sparkSession and context
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext

# Generating customer orc table files
# load raw data as an RDD
customer_data = sc.textFile("/data/tpch/customer.tbl")
# map the data into an RDD split with pipe delimitations
customer_split = customer_data.map(lambda l: l.split("|"))
# map the split data with a row method; this is where we specificy column names 
and types
# default type is string- UTF8
# there are issues with converting string to date and these issues have been 
addressed
# in those tables with dates: See notes below
customer_row = customer_split.map( lambda r: Row(
    custkey=long(r[0]),
    name=r[1],
    address=r[2],
    nationkey=long(r[3]),
    phone=r[4],
    acctbal=Decimal(r[5]),
    mktsegment=r[6],
    comment=r[7]
))

# we can have Spark infer the schema, or apply a strict schema and identify 
whether or not we want null values
# in this case we don't want null values for keys; and we want explicit data 
types to support the TPCH tables/ data model
customer_schema = types.StructType([
       types.StructField('custkey',types.LongType(),False)
       ,types.StructField('name',types.StringType())
       ,types.StructField('address',types.StringType())
       ,types.StructField('nationkey',types.LongType(),False)
       ,types.StructField('phone',types.StringType())
       ,types.StructField('acctbal',types.DecimalType())
       ,types.StructField('mktsegment',types.StringType())
       ,types.StructField('comment',types.StringType())])

# we create a dataframe object by referencing our sparkSession class and the 
createDataFrame method
# this method takes two arguments by default (row, schema)
customer_df = spark.createDataFrame(customer_row,customer_schema)

# we can now write a file of type orc by referencing our dataframe object we 
created
customer_df.write.orc("/data/tpch/customer.orc")

# read that same file we created but create a seperate dataframe object
customer_df_orc = spark.read.orc("/data/tpch/customer.orc")

# reference the newly created dataframe object and create a tempView for QA 
purposes
customer_df_orc.createOrReplaceTempView("customer")

# reference the sparkSession class and SQL method in order to issue SQL 
statements to the materialized view
spark.sql("SELECT * FROM customer LIMIT 10").show()

From: "☼ R Nair (रविशंकर नायर)" <ravishankar.n...@gmail.com>
Date: Friday, February 9, 2018 at 7:03 AM
To: "user @spark/'user @spark'/spark users/user@spark" <user@spark.apache.org>
Subject: Re: Spark Dataframe and HIVE

An update: (Sorry I missed)

When I do

passion_df.createOrReplaceTempView("sampleview")

spark.sql("create table sample table as select * from sample view")

Now, I can see table and can query as well.

So why this do work from Spark and other method discussed below is not?

Thanks



On Fri, Feb 9, 2018 at 9:49 AM, ☼ R Nair (रविशंकर नायर) 
<ravishankar.n...@gmail.com<mailto:ravishankar.n...@gmail.com>> wrote:
All,

It has been three days continuously I am on this issue. Not getting any clue.

Environment: Spark 2.2.x, all configurations are correct. hive-site.xml is in 
spark's conf.

1) Step 1: I created a data frame DF1 reading a csv file.

2) Did  manipulations on DF1. Resulting frame is passion_df.

3) passion_df.write.format("orc").saveAsTable("sampledb.passion")

4) The metastore shows the hive table., when I do "show tables" in HIVE, I can 
see table name

5) I can't select in HIVE, though I can select from SPARK as spark.sql("select 
* from sampledb.passion")

Whats going on here? Please help. Why I am not seeing data from HIVE prompt?
The "describe formatted " command on the table in HIVE shows he data is is in 
default warehouse location ( /user/hive/warehouse) since I set it.

I am not getting any definite answer anywhere. Many suggestions and answers 
given in Stackoverflow et al.Nothing really works.

So asking experts here for some light on this, thanks

Best,
Ravion





--
[mage removed by sender.]

Re: Spark Dataframe and HIVE

Reply via email to