You can get rid of processing overhead by using splittable compression for
your hive table data, something like 4mc <https://github.com/carlomedas/4mc>.
Or you can use hadoop's getmerge utility to merge small files periodically.

Thanks,
Chetna Chaudhari

On 11 November 2015 at 10:56, reveen joe <impdocs2...@gmail.com> wrote:

> Hi,
>
> Most of our Hive tables are SequenceFile tables and there are currently
> many small file ranging from *1-4 MB* under the Partition directories
> (created by insert-overwrite). I am assuming this is due to 2 reasons
>
> 1. Some of our tables are Bucketed and so individual files are created for
> each bucket of data for a given partition.
>
> 2. The places where we have set number of Reducers, produce 1 file per
> Reducer.
>
> So, the dir structure looks like below.
>
> /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000000_0
> /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000001_0
> /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000002_0
> /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000003_0
>
> ............................................................................................
>
> ............................................................................................
>
> ............................................................................................
> /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000379_0
>
> The Block size of the cluster is 128 MB. I know that sequence file can
> store FileName as Key and FileContent as Value in sequence Files but in
> this case they are independent files.
>
> Am I right that - this would add overhead to the further processing of
> this data as each file would need to spin up a JVM to start a Map Task
> against that file and also because of the Disk IO overhead?
>
> If so, what could be the best remedy to combine the small files under a
> partition directory? Thank you.
>
>

Reply via email to