Spark Performance

akhandeshi Wed, 29 Oct 2014 06:56:14 -0700

I am relatively new to spark processing. I am using Spark Java API to process
data.  I am having trouble processing a data set that I don't think is
significantly large.  It is joining a dataset that is around 3-4gb each
(around 12 gb data).


The workflow is: 
x=RDD1.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
y=RDD2.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
z=RDD3.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
o=RDD4.KeyBy(y).partitionBy(new HashPartitioner(10).cache() 
out=x.join(y).join(z).keyBy(y).partitionBy(new
HashPartitioner(10).cache().join(o) 
out.saveAsObject("Out"); 

The spark processor seems to be hung at "out=" step indefinitely.  I am
using kyro for serialization. using local with SPARK_MEM=90g.  I have 16cpu,
108g ram.  I am saving output to hadoop. 

I have also tried on a standalone cluster with 2 workers 8 cpu and 52 gb
ram.  My VMs are on google cloud.


Below is the table from the completed stages. 
Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Shuffle
Read    Shuffle Write 
8       keyBy at ProcessA.java:1094+details     10/27/2014 12:40        2.0 min 
10-Oct  
3       filter at ProcessA.java:1079+details    10/27/2014 12:40        2.0 min 
10-Oct  
2       keyBy at ProcessA.java:1071+details     10/27/2014 12:39        39 s    
11-Nov  268.4 MB
25.7 MB 
1       filter at ProcessA.java:1103+details    10/27/2014 12:39        16 s    
9-Sep   58.8 MB
30.4 MB 
7       keyBy at ProcessA.java:1045+details     10/27/2014 12:39        32 s    
24/24   2.8 GB
573.8 MB 
6       keyBy at ProcessA.java:1045+details     10/27/2014 12:39        40 s    
11-Nov  268.4 MB
24.5 MB 
________________________ 

Somethings, I don't understand..  I see output in the logfiles where it is
indicating it is spilling in-memory map to disk, and the spilling size is
greater than the input.  I am not sure how to avoid that or reduce that... 
I also tried the cluster mode where I observed the same behavior, so I
questioned whether the processes are running in parallel or serial? 

14/10/27 14:11:33 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 1000 MB to disk ( 
15 times so far) 
14/10/27 14:11:34 INFO collection.ExternalAppendOnlyMap: Thread 107 spilling
in-memory map of 2351 MB to disk 
(2 times so far) 
14/10/27 14:11:36 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 1000 MB to disk ( 
16 times so far) 
14/10/27 14:11:37 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling
in-memory map of 4781 MB to disk ( 
10 times so far) 
14/10/27 14:11:38 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling
in-memory map of 1243 MB to disk 
(10 times so far) 
14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 983 MB to disk (1 
7 times so far) 
14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling
in-memory map of 75546 MB to disk 
(11 times so far) 
14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 106 spilling
in-memory map of 2324 MB to disk 
(7 times so far) 
14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling
in-memory map of 1729 MB to disk 
(11 times so far) 
14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling
in-memory map of 2410 MB to disk ( 
12 times so far) 
14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling
in-memory map of 1211 MB to disk 


I would appreciate any pointers in  the right direction! 
_______________
by the way, I also see behavior described error messages like 

Not enough space to cache partition rdd_21_4 -indicating perhaps nothing is
getting cached. 

per - 
http://mail-archives.apache.org/mod_mbox/spark-issues/201409.mbox/%3cjira.12744773.1412020990000.148323.1412021014...@atlassian.jira%3E




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-tp17640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Performance

Reply via email to