There's nothing intrinsically wrong with a large output file that's in a 
split-able format such as Avro. Are your downstream queries too slow?
Are you using some kind of compression?

Within an avro file there are blocks of avro objects. Each block can be 
compressed. Splits can occur only on a block boundary.
I haven't find out how to set those block sizes from within Hive. We've never 
had to (from Hive).

Generally speaking, you will get 1 file per reducer, to get more reducers, you 
should define bucketing on your table. Tune the # buckets to get the files of 
the size you want?
For your bucket column, pick a high cardinality column that you will likely 
join on as your candidate.

Let us know how it turns out.

- Douglas

From: Slava Markeyev 
<slava.marke...@upsight.com<mailto:slava.marke...@upsight.com>>
Reply-To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Fri, 9 Jan 2015 17:04:08 -0800
To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: Hive Insert overwrite creating a single file with large block size

You can control block size by setting dfs.block.size. However, I think you 
might be asking how to control the size of and number of files generated on 
insert. Is that correct?

On Fri, Jan 9, 2015 at 4:41 PM, Buntu Dev 
<buntu...@gmail.com<mailto:buntu...@gmail.com>> wrote:
I got a bunch of small Avro files (<5MB) and have a table against those files. 
I created a new table and did an 'INSERT OVERWRITE' selecting from the existing 
table but did not find any option to provide the file block size. It currently 
creates a single file per partition.

How do I specify the output block size during the 'INSERT OVERWRITE'?

Thanks!



--

Slava Markeyev | Engineering | Upsight

<http://www.linkedin.com/in/slavamarkeyev><http://www.linkedin.com/in/slavamarkeyev>

Reply via email to