Dear all, We encountered problems of failed jobs with huge amount of data.
A simple local test was prepared for this question at https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb It generates 2 sets of key-value pairs, join them, selects distinct values and counts data finally. object Spill { def generate = { for{ j <- 1 to 10 i <- 1 to 200 } yield(j, i) } def main(args: Array[String]) { val conf = new SparkConf().setAppName(getClass.getSimpleName) conf.set("spark.shuffle.spill", "true") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sc = new SparkContext(conf) println(generate) val dataA = sc.parallelize(generate) val dataB = sc.parallelize(generate) val dst = dataA.join(dataB).distinct().count() println(dst) } } We compiled it locally and run 3 times with different settings of memory: 1) *--executor-memory 10M --driver-memory 10M --num-executors 1 --executor-cores 1* It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at ..... org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) 2) *--executor-memory 20M --driver-memory 20M --num-executors 1 --executor-cores 1* It works OK 3) *--executor-memory 10M --driver-memory 10M --num-executors 1 --executor-cores 1* But let's make less data for i from 200 to 100. It reduces input data in 2 times and joined data in 4 times def generate = { for{ j <- 1 to 10 i <- 1 to 100 // previous value was 200 } yield(j, i) } This code works OK. We don't understand why 10M is not enough for such simple operation with 32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works if we change the data volume in 2 times (2000 of records of (int, int)). Why spilling to disk doesn't cover this case? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Simple-local-test-failed-depending-on-memory-settings-tp19473.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org