Part of it can be the disk cannot keep up with your CPU, and the other is stragglers. Some partitions might be bigger, etc.
-- Reynold Xin, AMPLab, UC Berkeley http://rxin.org On Mon, Sep 23, 2013 at 8:50 PM, Xiang Huo <[email protected]> wrote: > What I am doing is splitting a large RDD into several small ones and write > each small one into separate file. I use Partitioner and PairRDDFunctions > classes to finish this things. But what is very strange is that every time > I run my program, all CPUs could run with full workload at the beginning. > But at the end, There are only one cpu can work with full workload. I guess > the reason is the part of writing to disk, But I have no idea how to > improve it. > > This is the code I use to write each partition into file: > data_partitions.foreachPartition(r => { > if(r.nonEmpty){ > val filename = r.take(1).toArray.apply(0)._1 > //println("*************************") > println("Filename: " + filename) > val outFile = new java.io.FileWriter(outPath + filename) > r.map(record => new ParseDNSFast().antiConvert(record._2)).foreach(r > => outFile.write(r+"\n")) > outFile.close > } > }) > > Any help is appreciated!. > > > > > 2013/9/23 Reynold Xin <[email protected]> > >> It's probably just because your application is only using 2 threads. >> Spark should be allocating a thread pool large enough, but the RDD's you >> are operating on have only 2 partitions, for example. >> >> To give it a try, do >> >> sc.parallelize(1 to 10, 20).mapPartitions { iter => >> Thread.sleep(10000000); iter }.count >> >> And see if you have 20 tasks being launched. >> >> >> >> -- >> Reynold Xin, AMPLab, UC Berkeley >> http://rxin.org >> >> >> >> On Sun, Sep 22, 2013 at 9:48 PM, Xiang Huo <[email protected]>wrote: >> >>> Hi all, >>> >>> I am trying to run a spark program on a server. It is not a cluster but >>> only a server. I want to configure my spark program can use at most 20 CPU, >>> because this machine is also shared by other users. >>> >>> I know I can set local[K] as the value of Master URLs to limited how >>> many worker threads in this program. But after I run my program, there is >>> only at least two CPUs used. And the program will be run a long time if >>> there is only one or two cpus used. >>> >>> Does any one have met similar situation or have any suggestion? >>> >>> Thanks. >>> >>> Xiang >>> -- >>> Xiang Huo >>> Department of Computer Science >>> University of Illinois at Chicago(UIC) >>> Chicago, Illinois >>> US >>> Email: [email protected] >>> or [email protected] >>> >> >> > > > -- > Xiang Huo > Department of Computer Science > University of Illinois at Chicago(UIC) > Chicago, Illinois > US > Email: [email protected] > or [email protected] >
