On Tue, Jan 14, 2014 at 5:00 PM, Archit Thakur <[email protected]>wrote:
> Hadoop block size decreased, do you mean HDFS block size? That is not > possible. > Sorry for terminology mix up. In my question 'hadoop block size' should probably be replaced by 'RDD partitions number'. I'm getting a large number of small files (named part-*), and I'd like to get a smaller number of larger files. I used something like: val rdd1 = sc.parallelize(Range(0, N, 1)) // N ~ 1e3 val rdd2 = rdd1.cartesian(rdd1) Is the number of part-* files determined by rdd2.partitions.length? Is there a way to keep the size of each part-* file a constant (eg 64 MB) regardless of other parameters, including number of available cores and scheduled tasks? > Block size of HDFS is never affected by your spark jobs. > > "For a big number of tasks, I get a very high number of 1 MB files > generated by saveAsSequenceFile()." > > What do you mean by "big number of tasks" > > No. of files generated by saveAsSequenceFile() increases if your > partitions of RDD are increased. > > Are you using your custom RDD? If Yes, you would have overridden the > method getPartitions - Check that. > If not, you might have used an operation where you specify your > partitioner or no. of output partitions, eg. groupByKey() - Check that. > > "How is it possible to control the block size by spark?" Do you mean "How > is it possible to control the output partitions of an RDD?" > > > On Tue, Jan 14, 2014 at 7:59 AM, Aureliano Buendia > <[email protected]>wrote: > >> Hi, >> >> Does the output hadoop block size depend on spark tasks number? >> >> In my application, when the number of tasks increases, hadoop block size >> decreases. For a big number of tasks, I get a very high number of 1 MB >> files generated by saveAsSequenceFile(). >> >> How is it possible to control the block size by spark? >> > >
