You can get rid of processing overhead by using splittable compression for your hive table data, something like 4mc <https://github.com/carlomedas/4mc>. Or you can use hadoop's getmerge utility to merge small files periodically.
Thanks, Chetna Chaudhari On 11 November 2015 at 10:56, reveen joe <impdocs2...@gmail.com> wrote: > Hi, > > Most of our Hive tables are SequenceFile tables and there are currently > many small file ranging from *1-4 MB* under the Partition directories > (created by insert-overwrite). I am assuming this is due to 2 reasons > > 1. Some of our tables are Bucketed and so individual files are created for > each bucket of data for a given partition. > > 2. The places where we have set number of Reducers, produce 1 file per > Reducer. > > So, the dir structure looks like below. > > /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000000_0 > /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000001_0 > /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000002_0 > /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000003_0 > > ............................................................................................ > > ............................................................................................ > > ............................................................................................ > /PATH/TO/TABLE/DIR/partition_column=2015-11-01//000379_0 > > The Block size of the cluster is 128 MB. I know that sequence file can > store FileName as Key and FileContent as Value in sequence Files but in > this case they are independent files. > > Am I right that - this would add overhead to the further processing of > this data as each file would need to spin up a JVM to start a Map Task > against that file and also because of the Disk IO overhead? > > If so, what could be the best remedy to combine the small files under a > partition directory? Thank you. > >