I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4. I am selecting a subset of a large dataset and trying to run queries on the cached schema RDD. Strangely, in web UI, I see the following.
150 Partitions Block Name Storage Level Size in Memory ▴ Size on Disk Executors rdd_30_68 Memory Deserialized 1x Replicated 307.5 MB 0.0 B ip-172-31-45-100.ec2.internal:37796 rdd_30_133 Memory Deserialized 1x Replicated 216.0 MB 0.0 B ip-172-31-45-101.ec2.internal:55947 rdd_30_18 Memory Deserialized 1x Replicated 194.2 MB 0.0 B ip-172-31-42-159.ec2.internal:43543 rdd_30_24 Memory Deserialized 1x Replicated 173.3 MB 0.0 B ip-172-31-45-101.ec2.internal:55947 rdd_30_70 Memory Deserialized 1x Replicated 168.2 MB 0.0 B ip-172-31-18-220.ec2.internal:39847 rdd_30_105 Memory Deserialized 1x Replicated 154.1 MB 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_79 Memory Deserialized 1x Replicated 153.9 MB 0.0 B ip-172-31-45-99.ec2.internal:59538 rdd_30_60 Memory Deserialized 1x Replicated 4.2 MB 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_99 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_90 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_9 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-18-220.ec2.internal:39847 rdd_30_89 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 What is strange to me is the size in Memory is mostly 112Bytes except for 8 of them. ( I have 9 data files in Hadoop, which are well distributed 64mb blocks ). The tasks processing the rdd are getting stuck after finishing few initial tasks. I am wondering, it is because, the spark has hit the large blocks and trying to process them on one worker per task. Any suggestions on how I can distribute them more evenly (Size of blocks) ? And why my hadoop blocks are nicely even and spark cached RDD has such a uneven distribution ? Any help is appreciated. Regards Ram -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org