Re: Controlling hadoop block size

Aureliano Buendia Tue, 14 Jan 2014 10:22:42 -0800

On Tue, Jan 14, 2014 at 5:00 PM, Archit Thakur <[email protected]>wrote:


> Hadoop block size decreased, do you mean HDFS block size? That is not
> possible.
>

Sorry for terminology mix up. In my question 'hadoop block size' should
probably be replaced by 'RDD partitions number'.

I'm getting a large number of small files (named part-*), and I'd like to
get a smaller number of larger files.

I used something like:

val rdd1 = sc.parallelize(Range(0, N, 1)) // N ~ 1e3
val rdd2 = rdd1.cartesian(rdd1)

Is the number of part-* files determined by rdd2.partitions.length?

Is there a way to keep the size of each part-* file a constant (eg 64 MB)
regardless of other parameters, including number of available cores and
scheduled tasks?


> Block size of HDFS is never affected by your spark jobs.
>
> "For a big number of tasks, I get a very high number of 1 MB files
> generated by saveAsSequenceFile()."
>
> What do you mean by "big number of tasks"
>
> No. of files generated by saveAsSequenceFile() increases if your
> partitions of RDD are increased.
>
> Are you using your custom RDD? If Yes, you would have overridden the
> method getPartitions - Check that.
> If not, you might have used an operation where you specify your
> partitioner or no. of output partitions, eg. groupByKey() - Check that.
>
> "How is it possible to control the block size by spark?" Do you mean "How
> is it possible to control the output partitions of an RDD?"
>
>
> On Tue, Jan 14, 2014 at 7:59 AM, Aureliano Buendia 
> <[email protected]>wrote:
>
>> Hi,
>>
>> Does the output hadoop block size depend on spark tasks number?
>>
>> In my application, when the number of tasks increases, hadoop block size
>> decreases. For a big number of tasks, I get a very high number of 1 MB
>> files generated by saveAsSequenceFile().
>>
>> How is it possible to control the block size by spark?
>>
>
>

Re: Controlling hadoop block size

Reply via email to