I think merging the files afterwards is the right approach. Setting hive.merge.mapredfiles to true worked for me. It will still generate many (eg 32) files, and then it'll run a second job that merges the 32. Also, in my queries, I have the TRANSFORM and USING classes after INSERT OVERWRITE. I don't know if that makes a difference or not. Something like this (untested):
FROM ( FROM src SELECT key, value ) tmap INSERT OVERWRITE TABLE dest1 TRANSFORM(key, value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' USING '/bin/cat' AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader' On Mon, Aug 15, 2011 at 12:10 PM, Loren Siebert <lo...@siebert.org> wrote: > I’m running into an issue with Hive’s TRANSFORM where the output always > gets split among 32 files. Somebody else also ran into a similar issue and > we posted on the CDH group last week (http://bit.ly/nR4tyg), but I’m > mentioning it here as it’s Hive-specific. > > I'm doing something structurally identical to this sample query from the > Hive manual ( > https://cwiki.apache.org/confluence/display/Hive/LanguageManual > +Transform): > FROM ( > FROM src > SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE > 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' > USING '/bin/cat' > AS (tkey, tvalue) ROW FORMAT SERDE > 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' > RECORDREADER > 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader' > ) tmap > INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue > > In my case, 32 reducers are launched, and dest1 always ends up with 32 > files. If I set hive.exec.reducers.max=1, it does launch only 1 reducer > (instead of 32), but I still get 32 teeny output files. Setting the > various "hive.merge.*” options does not seem to have any effect. > > Is there something else I should be doing to get the output to be in one > large file instead of 32 small ones? > > > > > > -- Dave Brondsema Lead Software Engineer - sf.net Geeknet ==== This e- mail message is intended only for the named recipient(s) above. It may contain confidential and privileged information. If you are not the intended recipient you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and delete the message and any attachment(s) from your system. Thank you.