Re: Controlling hadoop block size

Archit Thakur Tue, 14 Jan 2014 09:01:02 -0800

Hadoop block size decreased, do you mean HDFS block size? That is not
possible.

Block size of HDFS is never affected by your spark jobs.

"For a big number of tasks, I get a very high number of 1 MB files
generated by saveAsSequenceFile()."

What do you mean by "big number of tasks"

No. of files generated by saveAsSequenceFile() increases if your partitions
of RDD are increased.

Are you using your custom RDD? If Yes, you would have overridden the method
getPartitions - Check that.
If not, you might have used an operation where you specify your partitioner
or no. of output partitions, eg. groupByKey() - Check that.

"How is it possible to control the block size by spark?" Do you mean "How
is it possible to control the output partitions of an RDD?"

On Tue, Jan 14, 2014 at 7:59 AM, Aureliano Buendia <[email protected]>wrote:

> Hi,
>
> Does the output hadoop block size depend on spark tasks number?
>
> In my application, when the number of tasks increases, hadoop block size
> decreases. For a big number of tasks, I get a very high number of 1 MB
> files generated by saveAsSequenceFile().
>
> How is it possible to control the block size by spark?
>

Re: Controlling hadoop block size

Reply via email to