Hello,
I've got a Spark SQL dataframe containing a "key" column. The queries I
want to run start by filtering on the key range. My question in outline: is
it possible to sort the dataset by key so as to do efficient key range
filters, before subsequently running a more complex SQL query?
I'm aware that such efficient filtering is possible for key-value RDDs,
i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks
as follows:
// Create a dataframe
val df: DataFrame = sqlContext.sql("SELECT * FROM ...")
val keyValRDD = df.rdd.map( (r: Row) => (r.getAs[String]("key"), r) )
// Sort by key - and cache.
val keyValRDDSorted = keyValRDD.sortByKey().cache
// Define a function to run SQL query on a range.
def queryRange(lower: String, upper: String, sql: String, tableName:
String) = {
val rangeRDD = keyValRDDSorted.filterByRange(lower, upper)
val rangeDF = sqlContext.createDataFrame(rangeRDD.map{ _._2 },
df.schema)
rangeDF.createTempView(tableName)
sqlContext.sql(sql)
}
// Invoke multiple times.
queryRange(...)
queryRange(...)
...
This works, and is efficient in that only the partitions containing the
relevant key range are processed. However, I understand that Spark SQL uses
an optimised storage format as compared to plain RDDs. The above workflow
can't take advantage of this, as it is the key-value RDD that is cached.
So, my specific question: Is there a more efficient way of achieving the
desired result?
Any pointers would be much appreciated.
Many thanks,
Michael
PS: This question was also asked on StackOverflow -
http://stackoverflow.com/questions/40129411/efficient-filtering-on-spark-sql-dataframes-with-ordered-keys
.