Re: How set how many cpu can be used by spark

Xiang Huo Mon, 23 Sep 2013 21:05:46 -0700

Is there any function can be used to right each partition into file
directly? I think my method, java.io.FileWriter for each partition, is not
a efficient way to do this.


Thanks.


2013/9/23 Reynold Xin <[email protected]>

> Part of it can be the disk cannot keep up with your CPU, and the other is
> stragglers. Some partitions might be bigger, etc.
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Mon, Sep 23, 2013 at 8:50 PM, Xiang Huo <[email protected]> wrote:
>
>> What I am doing is splitting a large RDD into several small ones and
>> write each small one into separate file. I use Partitioner and
>> PairRDDFunctions classes to finish this things. But what is very strange is
>> that every time I run my program, all CPUs could run with full workload at
>> the beginning. But at the end, There are only one cpu can work with full
>> workload. I guess the reason is the part of writing to disk, But I have no
>> idea how to improve it.
>>
>> This is the code I use to write each partition into file:
>>  data_partitions.foreachPartition(r => {
>>    if(r.nonEmpty){
>>    val filename = r.take(1).toArray.apply(0)._1
>>    //println("*************************")
>>    println("Filename: " + filename)
>>    val outFile = new java.io.FileWriter(outPath + filename)
>>    r.map(record => new ParseDNSFast().antiConvert(record._2)).foreach(r
>> => outFile.write(r+"\n"))
>>    outFile.close
>>    }
>>    })
>>
>> Any help is appreciated!.
>>
>>
>>
>>
>> 2013/9/23 Reynold Xin <[email protected]>
>>
>>> It's probably just because your application is only using 2 threads.
>>> Spark should be allocating a thread pool large enough, but the RDD's you
>>> are operating on have only 2 partitions, for example.
>>>
>>> To give it a try, do
>>>
>>> sc.parallelize(1 to 10, 20).mapPartitions { iter =>
>>> Thread.sleep(10000000); iter }.count
>>>
>>> And see if you have 20 tasks being launched.
>>>
>>>
>>>
>>> --
>>> Reynold Xin, AMPLab, UC Berkeley
>>> http://rxin.org
>>>
>>>
>>>
>>> On Sun, Sep 22, 2013 at 9:48 PM, Xiang Huo <[email protected]>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to run a spark program on a server. It is not a cluster but
>>>> only a server. I want to configure my spark program can use at most 20 CPU,
>>>> because this machine is also shared by other users.
>>>>
>>>> I know I can set local[K] as the value of Master URLs to limited how
>>>> many worker threads in this program. But after I run my program, there is
>>>> only at least two CPUs used. And the program will be run a long time if
>>>> there is only one or two cpus used.
>>>>
>>>> Does any one have met similar situation or have any suggestion?
>>>>
>>>> Thanks.
>>>>
>>>> Xiang
>>>> --
>>>> Xiang Huo
>>>> Department of Computer Science
>>>> University of Illinois at Chicago(UIC)
>>>> Chicago, Illinois
>>>> US
>>>> Email: [email protected]
>>>>            or [email protected]
>>>>
>>>
>>>
>>
>>
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email: [email protected]
>>            or [email protected]
>>
>
>


-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: [email protected]
           or [email protected]

Re: How set how many cpu can be used by spark

Reply via email to