Err I meant #1 :) - Nitay Founder & CTO
On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co> wrote: > Anyone have any thoughts on this? Trying to understand especially #2 if > it's a legit bug or something I'm doing wrong. > > - Nitay > Founder & CTO > > > On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co> wrote: > >> I have a simple S3 job to read a text file and do a line count. >> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The >> file is about 1.2GB. My setup is standalone spark cluster with 4 workers >> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against >> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark). >> >> The whole count is taking on the order of a couple of minutes, which >> seems extremely slow. >> I've been looking into it and so far have noticed two things, hoping the >> community has seen this before and knows what to do... >> >> 1) Every executor seems to make an S3 call to read the *entire file* before >> making another call to read just it's split. Here's a paste I've cleaned up >> to show just one task: http://goo.gl/XCfyZA. I've verified this happens >> in every task. It is taking a long time (40-50 seconds), I don't see why it >> is doing this? >> 2) I've tried a few numPartitions parameters. When I make the parameter >> anything below 21 it seems to get ignored. Under the hood FileInputFormat >> is doing something that always ends up with at least 21 partitions of ~64MB >> or so. I've also tried 40, 60, and 100 partitions and have seen that the >> performance only gets worse as I increase it beyond 21. I would like to try >> 8 just to see, but again I don't see how to force it to go below 21. >> >> Thanks for the help, >> - Nitay >> Founder & CTO >> >> >