I am relatively new to spark processing. I am using Spark Java API to process data. I am having trouble processing a data set that I don't think is significantly large. It is joining a dataset that is around 3-4gb each (around 12 gb data).
The workflow is: x=RDD1.KeyBy(x).partitionBy(new HashPartitioner(10).cache() y=RDD2.KeyBy(x).partitionBy(new HashPartitioner(10).cache() z=RDD3.KeyBy(x).partitionBy(new HashPartitioner(10).cache() o=RDD4.KeyBy(y).partitionBy(new HashPartitioner(10).cache() out=x.join(y).join(z).keyBy(y).partitionBy(new HashPartitioner(10).cache().join(o) out.saveAsObject("Out"); The spark processor seems to be hung at "out=" step indefinitely. I am using kyro for serialization. using local with SPARK_MEM=90g. I have 16cpu, 108g ram. I am saving output to hadoop. I have also tried on a standalone cluster with 2 workers 8 cpu and 52 gb ram. My VMs are on google cloud. Below is the table from the completed stages. Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Shuffle Read Shuffle Write 8 keyBy at ProcessA.java:1094+details 10/27/2014 12:40 2.0 min 10-Oct 3 filter at ProcessA.java:1079+details 10/27/2014 12:40 2.0 min 10-Oct 2 keyBy at ProcessA.java:1071+details 10/27/2014 12:39 39 s 11-Nov 268.4 MB 25.7 MB 1 filter at ProcessA.java:1103+details 10/27/2014 12:39 16 s 9-Sep 58.8 MB 30.4 MB 7 keyBy at ProcessA.java:1045+details 10/27/2014 12:39 32 s 24/24 2.8 GB 573.8 MB 6 keyBy at ProcessA.java:1045+details 10/27/2014 12:39 40 s 11-Nov 268.4 MB 24.5 MB ________________________ Somethings, I don't understand.. I see output in the logfiles where it is indicating it is spilling in-memory map to disk, and the spilling size is greater than the input. I am not sure how to avoid that or reduce that... I also tried the cluster mode where I observed the same behavior, so I questioned whether the processes are running in parallel or serial? 14/10/27 14:11:33 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling in-memory map of 1000 MB to disk ( 15 times so far) 14/10/27 14:11:34 INFO collection.ExternalAppendOnlyMap: Thread 107 spilling in-memory map of 2351 MB to disk (2 times so far) 14/10/27 14:11:36 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling in-memory map of 1000 MB to disk ( 16 times so far) 14/10/27 14:11:37 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling in-memory map of 4781 MB to disk ( 10 times so far) 14/10/27 14:11:38 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling in-memory map of 1243 MB to disk (10 times so far) 14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling in-memory map of 983 MB to disk (1 7 times so far) 14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling in-memory map of 75546 MB to disk (11 times so far) 14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 106 spilling in-memory map of 2324 MB to disk (7 times so far) 14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling in-memory map of 1729 MB to disk (11 times so far) 14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling in-memory map of 2410 MB to disk ( 12 times so far) 14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling in-memory map of 1211 MB to disk I would appreciate any pointers in the right direction! _______________ by the way, I also see behavior described error messages like Not enough space to cache partition rdd_21_4 -indicating perhaps nothing is getting cached. per - http://mail-archives.apache.org/mod_mbox/spark-issues/201409.mbox/%3cjira.12744773.1412020990000.148323.1412021014...@atlassian.jira%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-tp17640.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org