I have a simple S3 job to read a text file and do a line count. Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The file is about 1.2GB. My setup is standalone spark cluster with 4 workers each with 2 cores / 16GB ram. I'm using branch-1.2 code built against hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
The whole count is taking on the order of a couple of minutes, which seems extremely slow. I've been looking into it and so far have noticed two things, hoping the community has seen this before and knows what to do... 1) Every executor seems to make an S3 call to read the *entire file* before making another call to read just it's split. Here's a paste I've cleaned up to show just one task: http://goo.gl/XCfyZA. I've verified this happens in every task. It is taking a long time (40-50 seconds), I don't see why it is doing this? 2) I've tried a few numPartitions parameters. When I make the parameter anything below 21 it seems to get ignored. Under the hood FileInputFormat is doing something that always ends up with at least 21 partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and have seen that the performance only gets worse as I increase it beyond 21. I would like to try 8 just to see, but again I don't see how to force it to go below 21. Thanks for the help, - Nitay Founder & CTO