Hadoop block size decreased, do you mean HDFS block size? That is not possible.
Block size of HDFS is never affected by your spark jobs. "For a big number of tasks, I get a very high number of 1 MB files generated by saveAsSequenceFile()." What do you mean by "big number of tasks" No. of files generated by saveAsSequenceFile() increases if your partitions of RDD are increased. Are you using your custom RDD? If Yes, you would have overridden the method getPartitions - Check that. If not, you might have used an operation where you specify your partitioner or no. of output partitions, eg. groupByKey() - Check that. "How is it possible to control the block size by spark?" Do you mean "How is it possible to control the output partitions of an RDD?" On Tue, Jan 14, 2014 at 7:59 AM, Aureliano Buendia <[email protected]>wrote: > Hi, > > Does the output hadoop block size depend on spark tasks number? > > In my application, when the number of tasks increases, hadoop block size > decreases. For a big number of tasks, I get a very high number of 1 MB > files generated by saveAsSequenceFile(). > > How is it possible to control the block size by spark? >
