Err I meant #1 :)

- Nitay
Founder & CTO


On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co> wrote:

> Anyone have any thoughts on this? Trying to understand especially #2 if
> it's a legit bug or something I'm doing wrong.
>
> - Nitay
> Founder & CTO
>
>
> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co> wrote:
>
>> I have a simple S3 job to read a text file and do a line count.
>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
>>
>> The whole count is taking on the order of a couple of minutes, which
>> seems extremely slow.
>> I've been looking into it and so far have noticed two things, hoping the
>> community has seen this before and knows what to do...
>>
>> 1) Every executor seems to make an S3 call to read the *entire file* before
>> making another call to read just it's split. Here's a paste I've cleaned up
>> to show just one task: http://goo.gl/XCfyZA. I've verified this happens
>> in every task. It is taking a long time (40-50 seconds), I don't see why it
>> is doing this?
>> 2) I've tried a few numPartitions parameters. When I make the parameter
>> anything below 21 it seems to get ignored. Under the hood FileInputFormat
>> is doing something that always ends up with at least 21 partitions of ~64MB
>> or so. I've also tried 40, 60, and 100 partitions and have seen that the
>> performance only gets worse as I increase it beyond 21. I would like to try
>> 8 just to see, but again I don't see how to force it to go below 21.
>>
>> Thanks for the help,
>> - Nitay
>> Founder & CTO
>>
>>
>

Reply via email to