Hi all, We are actively testing/benchmarking spark for our production use. Here's some questions about problems we've encountered so far:
1. By default 66% of the executor memory is used for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache(), rdd.persiste(StorageLevel.MEM_AND_DISK) etc), this ram is wasted? Or spark allocates it dynamically or uses it for some auto-caching? 2. We've some code like below (basic join): val rddA = sc.textFile(..).map(...) val rddB = sc.textFile(...).filter(...).map(...) val c = rddA.join(rddB).map(...) c.count() All went well except the last task of join. The job always stucked there, and lead eventually to OOM. Even if we give sufficient memory and the job finally pass, the last task of join took significantly more time than other tasks (say several minutes vs 200ms). Some pointers on this problem? 3. For the above join, is it a best practice to make rddA and rddB co-partitioned before the join? val rddA = sc.textFile(..).map(...).partitionBy(new HashPartitioner(128)) val rddB = sc.textFile(...).filter(...).map(...).partitionBy(new HashPartitioner(128)) // like above Note that with this manual co-partitioning and various number of partitions, the problem above persists. We use spark_ec2 script with Spark 0.9.0, all data are on the ephemeral-hdfs. Thanks! -- *JU Han* Data Engineer @ Botify.com +33 0619608888