sqlContext.sql("select distinct CARRIER from flight201601") defines a dataframe which is lazily evaluated. This means that it returns a dataframe (which is what you got). If you want to see the results do: sqlContext.sql("select distinct CARRIER from flight201601").show() or df = sqlContext.sql("select distinct CARRIER from flight201601") df.show()
Assaf From: Raymond Xie [mailto:xie3208...@gmail.com] Sent: Monday, January 02, 2017 6:23 AM To: user Subject: What is missing here to use sql in spark? Happy new year! Below is my script: pyspark --packages com.databricks:spark-csv_2.10:1.4.0 from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///root/Downloads/data/flight201601short2.csv') df.show(5) df.registerTempTable("flight201601") sqlContext.sql("select distinct CARRIER from flight201601") df.show(5) is below: +----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+ |YEAR|QUARTER|MONTH|DAY_OF_MONTH|DAY_OF_WEEK| FL_DATE|UNIQUE_CARRIER|AIRLINE_ID|CARRIER|TAIL_NUM|FL_NUM| +----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+ |2016| 1| 1| 6| 3|2016-01-06| AA| 19805| AA| N4YBAA| 43| |2016| 1| 1| 7| 4|2016-01-07| AA| 19805| AA| N434AA| 43| |2016| 1| 1| 8| 5|2016-01-08| AA| 19805| AA| N541AA| 43| |2016| 1| 1| 9| 6|2016-01-09| AA| 19805| AA| N489AA| 43| |2016| 1| 1| 10| 7|2016-01-10| AA| 19805| AA| N439AA| 43| +----+-------+-----+------------+-----------+----------+--------------+----------+-------+--------+------+ The final result is NOT what I am expecting, it currently shows the following: >>> sqlContext.sql("select distinct CARRIER from flight201601") DataFrame[CARRIER: string] I am expecting the distinct CARRIER will be created: AA BB CC ... flight201601short2.csv is attached here for your reference. Thank you very much. ------------------------------------------------ Sincerely yours, Raymond