Concerning your second question, I believe you try to set number of
partitions with something like this:

    rdd = sc.textFile(..., 8)

but things like `textFile()` don't actually take fixed number of
partitions. Instead, they expect *minimal* number of partitions. Since in
your file you have 21 blocks of data, it creates exactly 21 worker (which
is greater than 8, as expected). To set exact number of partitions, use
`repartition()` or its full version - `coalesce()` (see example [1])

[1]:
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce



On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole <arang...@gmail.com> wrote:

> What makes you think that each executor is reading the whole file? If that
> is the case then the count value returned to the driver will be actual X
> NumOfExecutors. Is that the case when compared with actual lines in the
> input file? If the count returned is same as actual then you probably don't
> have an extra read problem.
>
> I also see this in your logs which indicates that it is a read that starts
> from an offset and reading one split size (64MB) worth of data:
>
> 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
> split: s3n://mybucket/myfile:335544320+67108864
> On Nov 22, 2014 7:23 AM, "Nitay Joffe" <ni...@actioniq.co> wrote:
>
>> Err I meant #1 :)
>>
>> - Nitay
>> Founder & CTO
>>
>>
>> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co> wrote:
>>
>>> Anyone have any thoughts on this? Trying to understand especially #2 if
>>> it's a legit bug or something I'm doing wrong.
>>>
>>> - Nitay
>>> Founder & CTO
>>>
>>>
>>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co> wrote:
>>>
>>>> I have a simple S3 job to read a text file and do a line count.
>>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
>>>>
>>>> The whole count is taking on the order of a couple of minutes, which
>>>> seems extremely slow.
>>>> I've been looking into it and so far have noticed two things, hoping
>>>> the community has seen this before and knows what to do...
>>>>
>>>> 1) Every executor seems to make an S3 call to read the *entire file* before
>>>> making another call to read just it's split. Here's a paste I've cleaned up
>>>> to show just one task: http://goo.gl/XCfyZA. I've verified this
>>>> happens in every task. It is taking a long time (40-50 seconds), I don't
>>>> see why it is doing this?
>>>> 2) I've tried a few numPartitions parameters. When I make the parameter
>>>> anything below 21 it seems to get ignored. Under the hood FileInputFormat
>>>> is doing something that always ends up with at least 21 partitions of ~64MB
>>>> or so. I've also tried 40, 60, and 100 partitions and have seen that the
>>>> performance only gets worse as I increase it beyond 21. I would like to try
>>>> 8 just to see, but again I don't see how to force it to go below 21.
>>>>
>>>> Thanks for the help,
>>>> - Nitay
>>>> Founder & CTO
>>>>
>>>>
>>>
>>

Reply via email to