Part of it can be the disk cannot keep up with your CPU, and the other is
stragglers. Some partitions might be bigger, etc.


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Mon, Sep 23, 2013 at 8:50 PM, Xiang Huo <[email protected]> wrote:

> What I am doing is splitting a large RDD into several small ones and write
> each small one into separate file. I use Partitioner and PairRDDFunctions
> classes to finish this things. But what is very strange is that every time
> I run my program, all CPUs could run with full workload at the beginning.
> But at the end, There are only one cpu can work with full workload. I guess
> the reason is the part of writing to disk, But I have no idea how to
> improve it.
>
> This is the code I use to write each partition into file:
>  data_partitions.foreachPartition(r => {
>    if(r.nonEmpty){
>    val filename = r.take(1).toArray.apply(0)._1
>    //println("*************************")
>    println("Filename: " + filename)
>    val outFile = new java.io.FileWriter(outPath + filename)
>    r.map(record => new ParseDNSFast().antiConvert(record._2)).foreach(r
> => outFile.write(r+"\n"))
>    outFile.close
>    }
>    })
>
> Any help is appreciated!.
>
>
>
>
> 2013/9/23 Reynold Xin <[email protected]>
>
>> It's probably just because your application is only using 2 threads.
>> Spark should be allocating a thread pool large enough, but the RDD's you
>> are operating on have only 2 partitions, for example.
>>
>> To give it a try, do
>>
>> sc.parallelize(1 to 10, 20).mapPartitions { iter =>
>> Thread.sleep(10000000); iter }.count
>>
>> And see if you have 20 tasks being launched.
>>
>>
>>
>> --
>> Reynold Xin, AMPLab, UC Berkeley
>> http://rxin.org
>>
>>
>>
>> On Sun, Sep 22, 2013 at 9:48 PM, Xiang Huo <[email protected]>wrote:
>>
>>> Hi all,
>>>
>>> I am trying to run a spark program on a server. It is not a cluster but
>>> only a server. I want to configure my spark program can use at most 20 CPU,
>>> because this machine is also shared by other users.
>>>
>>> I know I can set local[K] as the value of Master URLs to limited how
>>> many worker threads in this program. But after I run my program, there is
>>> only at least two CPUs used. And the program will be run a long time if
>>> there is only one or two cpus used.
>>>
>>> Does any one have met similar situation or have any suggestion?
>>>
>>> Thanks.
>>>
>>> Xiang
>>> --
>>> Xiang Huo
>>> Department of Computer Science
>>> University of Illinois at Chicago(UIC)
>>> Chicago, Illinois
>>> US
>>> Email: [email protected]
>>>            or [email protected]
>>>
>>
>>
>
>
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: [email protected]
>            or [email protected]
>

Reply via email to