Re: Disable Spark SQL Optimizations for unit tests
I found some ways to get faster unit tests.In the meantime they had gone up to about an hour. Apparently defining columns in a for loop makes catalyst very slow, as it blows up the logical plan with many projections: final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = { var df = dfIn for (toBeCasted <- castToInts) { df = df.withColumn(toBeCasted, df(toBeCasted).cast(IntegerType)) } df } This is much faster: final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = { val columns = dfIn.columns.map { c => if (castToInts.contains(c)) { dfIn(c).cast(IntegerType) } else { dfIn(c) } } dfIn.select(columns: _*) } As I consequently applied this to other similar functions the unit tests went down from 60 to 18 minutes. Another way to break SQL optimizations was to just save an intermediate dataframe to HDFS and read from there again. This is quite counter intuitive, but the unit tests then further went down from 18 minutes to 5. Is there any other way to add a barrier for catalyst optimizations? As in A -> B -> C, only optimize A -> B, and B -> C but not the complete A -> C? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Disable-Spark-SQL-Optimizations-for-unit-tests-tp28380p28426.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Disable Spark SQL Optimizations for unit tests
Hi, Can the Spark SQL Optimizations be disabled somehow? In our project we started 4 weeks ago to write scala / spark / dataframe code. We currently have only around 10% of the planned project scope, and we are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6, nothing cached) minutes for a single unit test run to finish. We have for example one scala file with maybe 80 lines of code (several joins, several subtrees reused in different places) that takes up to 6 minutes to be optimized (the catalyst output is also > 100 Mb). The input for our unit tests is usually 2 - 3 rows. That is the motivation to disable the optimizer in unit tests. I have found this unanswered SO post <http://stackoverflow.com/questions/33984152/how-to-speed-up-spark-sql-unit-tests> , but not much more on that topic. I have also found this SimpleTestOptimizer <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L151> which sounds perfect, but I have no idea how to instantiate a Spark Session so it uses that one. Does nobody else have this problem? Is there something fundamentally wrong with our approach? Regards, Stefan Ackermann -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Disable-Spark-SQL-Optimizations-for-unit-tests-tp28380.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org