Is there any function can be used to right each partition into file directly? I think my method, java.io.FileWriter for each partition, is not a efficient way to do this.
Thanks. 2013/9/23 Reynold Xin <[email protected]> > Part of it can be the disk cannot keep up with your CPU, and the other is > stragglers. Some partitions might be bigger, etc. > > > -- > Reynold Xin, AMPLab, UC Berkeley > http://rxin.org > > > > On Mon, Sep 23, 2013 at 8:50 PM, Xiang Huo <[email protected]> wrote: > >> What I am doing is splitting a large RDD into several small ones and >> write each small one into separate file. I use Partitioner and >> PairRDDFunctions classes to finish this things. But what is very strange is >> that every time I run my program, all CPUs could run with full workload at >> the beginning. But at the end, There are only one cpu can work with full >> workload. I guess the reason is the part of writing to disk, But I have no >> idea how to improve it. >> >> This is the code I use to write each partition into file: >> data_partitions.foreachPartition(r => { >> if(r.nonEmpty){ >> val filename = r.take(1).toArray.apply(0)._1 >> //println("*************************") >> println("Filename: " + filename) >> val outFile = new java.io.FileWriter(outPath + filename) >> r.map(record => new ParseDNSFast().antiConvert(record._2)).foreach(r >> => outFile.write(r+"\n")) >> outFile.close >> } >> }) >> >> Any help is appreciated!. >> >> >> >> >> 2013/9/23 Reynold Xin <[email protected]> >> >>> It's probably just because your application is only using 2 threads. >>> Spark should be allocating a thread pool large enough, but the RDD's you >>> are operating on have only 2 partitions, for example. >>> >>> To give it a try, do >>> >>> sc.parallelize(1 to 10, 20).mapPartitions { iter => >>> Thread.sleep(10000000); iter }.count >>> >>> And see if you have 20 tasks being launched. >>> >>> >>> >>> -- >>> Reynold Xin, AMPLab, UC Berkeley >>> http://rxin.org >>> >>> >>> >>> On Sun, Sep 22, 2013 at 9:48 PM, Xiang Huo <[email protected]>wrote: >>> >>>> Hi all, >>>> >>>> I am trying to run a spark program on a server. It is not a cluster but >>>> only a server. I want to configure my spark program can use at most 20 CPU, >>>> because this machine is also shared by other users. >>>> >>>> I know I can set local[K] as the value of Master URLs to limited how >>>> many worker threads in this program. But after I run my program, there is >>>> only at least two CPUs used. And the program will be run a long time if >>>> there is only one or two cpus used. >>>> >>>> Does any one have met similar situation or have any suggestion? >>>> >>>> Thanks. >>>> >>>> Xiang >>>> -- >>>> Xiang Huo >>>> Department of Computer Science >>>> University of Illinois at Chicago(UIC) >>>> Chicago, Illinois >>>> US >>>> Email: [email protected] >>>> or [email protected] >>>> >>> >>> >> >> >> -- >> Xiang Huo >> Department of Computer Science >> University of Illinois at Chicago(UIC) >> Chicago, Illinois >> US >> Email: [email protected] >> or [email protected] >> > > -- Xiang Huo Department of Computer Science University of Illinois at Chicago(UIC) Chicago, Illinois US Email: [email protected] or [email protected]
