Konstantin Shaposhnikov created SPARK-6489: ----------------------------------------------
Summary: Optimize lateral view with explode to not read unnecessary columns Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Currently a query with "lateral view explode(...)" results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L] Project [name#0,d#6] Generate explode(data#2), true, false PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person("A", 20, Array(10, 12, 19)), Person("B", 25, Array(7, 8, 4)), Person("C", 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster("local[2]").setAppName("sql") val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable("ppl") val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) d as d group by name") s.explain(true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org