The command val rdd1 = sc.parallelize(Range(0, N, 1)) // N ~ 1e3 val rdd2 = rdd1.cartesian(rdd1)
will create a "CartesianRDD" which has no. of partitions = N^2 (N defined by your code). Q: Is the number of part-* files determined by rdd2.partitions.length? Yes. Q: Is there a way to keep the size of each part-* file a constant (eg 64 MB) regardless of other parameters, including number of available cores and scheduled tasks? Lets say your RDD has N MB of data. You can create your own CustmRDD and by overriding getPartitions and partitioner- you can create no .of partitions = N/64 and distribute the data equally (64MB). and perform rdd.operationToTransformItInto_CustmRDD. PS: There might be an operation/RDD that already does the same, I am not aware of it as of now. Please let me know, if you are able to figure it out. Thanks and Regards, Archit Thakur. On Tue, Jan 14, 2014 at 11:51 PM, Aureliano Buendia <[email protected]>wrote: > > > > On Tue, Jan 14, 2014 at 5:00 PM, Archit Thakur > <[email protected]>wrote: > >> Hadoop block size decreased, do you mean HDFS block size? That is not >> possible. >> > > Sorry for terminology mix up. In my question 'hadoop block size' should > probably be replaced by 'RDD partitions number'. > > I'm getting a large number of small files (named part-*), and I'd like to > get a smaller number of larger files. > > I used something like: > > val rdd1 = sc.parallelize(Range(0, N, 1)) // N ~ 1e3 > val rdd2 = rdd1.cartesian(rdd1) > > Is the number of part-* files determined by rdd2.partitions.length? > > Is there a way to keep the size of each part-* file a constant (eg 64 MB) > regardless of other parameters, including number of available cores and > scheduled tasks? > > >> Block size of HDFS is never affected by your spark jobs. >> >> "For a big number of tasks, I get a very high number of 1 MB files >> generated by saveAsSequenceFile()." >> >> What do you mean by "big number of tasks" >> >> No. of files generated by saveAsSequenceFile() increases if your >> partitions of RDD are increased. >> >> Are you using your custom RDD? If Yes, you would have overridden the >> method getPartitions - Check that. >> If not, you might have used an operation where you specify your >> partitioner or no. of output partitions, eg. groupByKey() - Check that. >> >> "How is it possible to control the block size by spark?" Do you mean "How >> is it possible to control the output partitions of an RDD?" >> >> >> On Tue, Jan 14, 2014 at 7:59 AM, Aureliano Buendia >> <[email protected]>wrote: >> >>> Hi, >>> >>> Does the output hadoop block size depend on spark tasks number? >>> >>> In my application, when the number of tasks increases, hadoop block size >>> decreases. For a big number of tasks, I get a very high number of 1 MB >>> files generated by saveAsSequenceFile(). >>> >>> How is it possible to control the block size by spark? >>> >> >> >
