On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov <[email protected]> wrote:
> Can you pre-aggregate your historical data to reduce the number of files? > > We used to partition our data by date but that created too many output > files so now we partition by month. > > I do find it odd that Hive (0.6) can't merge compressed output files. We > could have gotten away with daily partitioning if Hive could merge small > files. I tried disabling compression but it actually caused some execution > problems (perhaps xcievers -related I am not sure) > > On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan <[email protected]>wrote: > >> Today I tried CombineHiveInputFormat and set the max split size for hadoop >> input. Seems I can get the expected map tasks number. But another problem is >> the cpu is consumed highly by map tasks. almost 100%. >> >> I just ran a query with simple WHERE condition over testing files,whose >> total size is about 30M and there are about 10 thousand small files. The >> execution time over 700s. It's killing us. Because the files are generated >> by flume, all files is seq file. >> >> >> R >> >> On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <[email protected]>wrote: >> >>> Hi Guys >>> >>> I use flume to store log file , and use hive to query. >>> >>> Flume always store the small file with suffix .seq Now I have over 35 >>> thousand seq files. Every time when I launch query script, 35 thousand map >>> tasks will be created and it's so long time to wait for completing. >>> >>> I also try to set CombineHiveInputFormat, but if I set this option, it >>> seems the task will be executed slowly. Because total size of the data >>> folder over 700M. Now in my testing env, I only have 3 data nodes. I also >>> tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting, >>> seems doesn't work. There's alway only one map task if >>> set CombineHiveInputFormat. >>> >>> Can you plz show me a solution in which I can set map task number freely >>> >>> BTW: version for hadoop is 20 and hive is 0.5 >>> >>> Richard >>> >> >> > We have open sourced our filecrusher/optimizer, you post reminded be to throw our new V2 version over the open source fence. http://www.jointhegrid.com/hadoop_filecrush/index.jsp I know many are looking for an in-hive solution, but file crusher does the job for us. Edward
