Hi Michael.

What type of table is the underlying table? Is it Hbase, Hive ORC or what?

By Key you mean a UNIQUE ID or something similar and then you do multiple
scans on the tempTable which stores data using in-memory columnar format.

That is the optimisation of tempTable storage as far as I know.

Have you tried it using predicate push-down on the underlying table itself?

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 10:06, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hello,
>
> I've got a Spark SQL dataframe containing a "key" column. The queries I
> want to run start by filtering on the key range. My question in outline: is
> it possible to sort the dataset by key so as to do efficient key range
> filters, before subsequently running a more complex SQL query?
>
> I'm aware that such efficient filtering is possible for key-value RDDs,
> i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks
> as follows:
>
> // Create a dataframe
> val df: DataFrame = sqlContext.sql("SELECT * FROM ...")
> val keyValRDD = df.rdd.map( (r: Row) => (r.getAs[String]("key"), r) )
>
> // Sort by key - and cache.
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> // Define a function to run SQL query on a range.
> def queryRange(lower: String, upper: String, sql: String, tableName:
> String) = {
>     val rangeRDD = keyValRDDSorted.filterByRange(lower, upper)
>     val rangeDF = sqlContext.createDataFrame(rangeRDD.map{ _._2 },
> df.schema)
>     rangeDF.createTempView(tableName)
>     sqlContext.sql(sql)
> }
>
> // Invoke multiple times.
> queryRange(...)
> queryRange(...)
> ...
>
> This works, and is efficient in that only the partitions containing the
> relevant key range are processed. However, I understand that Spark SQL uses
> an optimised storage format as compared to plain RDDs. The above workflow
> can't take advantage of this, as it is the key-value RDD that is cached.
>
> So, my specific question: Is there a more efficient way of achieving the
> desired result?
>
> Any pointers would be much appreciated.
>
> Many thanks,
> Michael
>
> PS: This question was also asked on StackOverflow -
> http://stackoverflow.com/questions/40129411/efficient-
> filtering-on-spark-sql-dataframes-with-ordered-keys.
>

Reply via email to