RE: What is missing here to use sql in spark?

Mendelson, Assaf Mon, 02 Jan 2017 05:24:08 -0800

sqlContext.sql("select distinct CARRIER from flight201601") defines a dataframe 
which is lazily evaluated.
This means that it returns a dataframe (which is what you got).
If you want to see the results do:
sqlContext.sql("select distinct CARRIER from flight201601").show()
or
df = sqlContext.sql("select distinct CARRIER from flight201601")
df.show()


Assaf


From: Raymond Xie [mailto:xie3208...@gmail.com]
Sent: Monday, January 02, 2017 6:23 AM
To: user
Subject: What is missing here to use sql in spark?

Happy new year!

Below is my script:

pyspark --packages com.databricks:spark-csv_2.10:1.4.0
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', 
inferschema='true').load('file:///root/Downloads/data/flight201601short2.csv')
df.show(5)
df.registerTempTable("flight201601")
sqlContext.sql("select distinct CARRIER from flight201601")

df.show(5) is below:

+----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+
|YEAR|QUARTER|MONTH|DAY_OF_MONTH|DAY_OF_WEEK|   
FL_DATE|UNIQUE_CARRIER|AIRLINE_ID|CARRIER|TAIL_NUM|FL_NUM|
+----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+
|2016|      1|    1|           6|          3|2016-01-06|            AA|     
19805|     AA|  N4YBAA|    43|
|2016|      1|    1|           7|          4|2016-01-07|            AA|     
19805|     AA|  N434AA|    43|
|2016|      1|    1|           8|          5|2016-01-08|            AA|     
19805|     AA|  N541AA|    43|
|2016|      1|    1|           9|          6|2016-01-09|            AA|     
19805|     AA|  N489AA|    43|
|2016|      1|    1|          10|          7|2016-01-10|            AA|     
19805|     AA|  N439AA|    43|
+----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+

The final result is NOT what I am expecting, it currently shows the following:

>>> sqlContext.sql("select distinct CARRIER from flight201601")
DataFrame[CARRIER: string]

I am expecting the distinct CARRIER will be created:

AA
BB
CC
...

flight201601short2.csv is attached here for your reference.


Thank you very much.



------------------------------------------------
Sincerely yours,


Raymond

RE: What is missing here to use sql in spark?

Reply via email to