There's nothing intrinsically wrong with a large output file that's in a split-able format such as Avro. Are your downstream queries too slow? Are you using some kind of compression?
Within an avro file there are blocks of avro objects. Each block can be compressed. Splits can occur only on a block boundary. I haven't find out how to set those block sizes from within Hive. We've never had to (from Hive). Generally speaking, you will get 1 file per reducer, to get more reducers, you should define bucketing on your table. Tune the # buckets to get the files of the size you want? For your bucket column, pick a high cardinality column that you will likely join on as your candidate. Let us know how it turns out. - Douglas From: Slava Markeyev <slava.marke...@upsight.com<mailto:slava.marke...@upsight.com>> Reply-To: <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Fri, 9 Jan 2015 17:04:08 -0800 To: <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Re: Hive Insert overwrite creating a single file with large block size You can control block size by setting dfs.block.size. However, I think you might be asking how to control the size of and number of files generated on insert. Is that correct? On Fri, Jan 9, 2015 at 4:41 PM, Buntu Dev <buntu...@gmail.com<mailto:buntu...@gmail.com>> wrote: I got a bunch of small Avro files (<5MB) and have a table against those files. I created a new table and did an 'INSERT OVERWRITE' selecting from the existing table but did not find any option to provide the file block size. It currently creates a single file per partition. How do I specify the output block size during the 'INSERT OVERWRITE'? Thanks! -- Slava Markeyev | Engineering | Upsight <http://www.linkedin.com/in/slavamarkeyev><http://www.linkedin.com/in/slavamarkeyev>