I found some ways to get faster unit tests.In the meantime they had gone up
to about an hour.

Apparently defining columns in a for loop makes catalyst very slow, as it
blows up the logical plan with many projections:

  final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = {
    var df = dfIn
    for (toBeCasted <- castToInts) {
      df = df.withColumn(toBeCasted, df(toBeCasted).cast(IntegerType))
    }
    df
  }

This is much faster:

  final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = {
    val columns = dfIn.columns.map { c =>
      if (castToInts.contains(c)) {
        dfIn(c).cast(IntegerType)
      } else {
        dfIn(c)
      }
    }
    dfIn.select(columns: _*)    
  }

As I consequently applied this to other similar functions the unit tests
went down from 60 to 18 minutes.

Another way to break SQL optimizations was to just save an intermediate
dataframe to HDFS and read from there again. This is quite counter
intuitive, but the unit tests then further went down from 18 minutes to 5.

Is there any other way to add a barrier for catalyst optimizations? As in A
-> B -> C, only optimize A -> B, and B -> C but not the complete A -> C?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Disable-Spark-SQL-Optimizations-for-unit-tests-tp28380p28426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to