Hi Guys I use flume to store log file , and use hive to query.
Flume always store the small file with suffix .seq Now I have over 35 thousand seq files. Every time when I launch query script, 35 thousand map tasks will be created and it's so long time to wait for completing. I also try to set CombineHiveInputFormat, but if I set this option, it seems the task will be executed slowly. Because total size of the data folder over 700M. Now in my testing env, I only have 3 data nodes. I also tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting, seems doesn't work. There's alway only one map task if set CombineHiveInputFormat. Can you plz show me a solution in which I can set map task number freely BTW: version for hadoop is 20 and hive is 0.5 Richard
