Questions about productionizing spark

Han JU Fri, 25 Apr 2014 05:40:13 -0700

Hi all,

We are actively testing/benchmarking spark for our production use. Here's
some questions about problems we've encountered so far:


  1. By default 66% of the executor memory is used for RDD caching, so if
there's no explicit caching in the code (eg. rdd.cache(),
rdd.persiste(StorageLevel.MEM_AND_DISK) etc), this ram is wasted? Or spark
allocates it dynamically or uses it for some auto-caching?

  2. We've some code like below (basic join):

    val rddA = sc.textFile(..).map(...)
    val rddB = sc.textFile(...).filter(...).map(...)

    val c = rddA.join(rddB).map(...)
    c.count()

  All went well except the last task of join. The job always stucked there,
and lead eventually to OOM.  Even if we give sufficient memory and the job
finally pass, the last task of join took significantly more time than other
tasks (say several minutes vs 200ms). Some pointers on this problem?

  3. For the above join, is it a best practice to make rddA and rddB
co-partitioned before the join?

    val rddA = sc.textFile(..).map(...).partitionBy(new
HashPartitioner(128))
    val rddB = sc.textFile(...).filter(...).map(...).partitionBy(new
HashPartitioner(128))

     // like above

    Note that with this manual co-partitioning and various number of
partitions, the problem above persists.


We use spark_ec2 script with Spark 0.9.0, all data are on the
ephemeral-hdfs.

Thanks!

-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Questions about productionizing spark

Reply via email to