I have a simple S3 job to read a text file and do a line count.
Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
file is about 1.2GB. My setup is standalone spark cluster with 4 workers
each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).

The whole count is taking on the order of a couple of minutes, which seems
extremely slow.
I've been looking into it and so far have noticed two things, hoping the
community has seen this before and knows what to do...

1) Every executor seems to make an S3 call to read the *entire file* before
making another call to read just it's split. Here's a paste I've cleaned up
to show just one task: http://goo.gl/XCfyZA. I've verified this happens in
every task. It is taking a long time (40-50 seconds), I don't see why it is
doing this?
2) I've tried a few numPartitions parameters. When I make the parameter
anything below 21 it seems to get ignored. Under the hood FileInputFormat
is doing something that always ends up with at least 21 partitions of ~64MB
or so. I've also tried 40, 60, and 100 partitions and have seen that the
performance only gets worse as I increase it beyond 21. I would like to try
8 just to see, but again I don't see how to force it to go below 21.

Thanks for the help,
- Nitay
Founder & CTO

Reply via email to