Re: question about number of map tasks for small file

Edward Capriolo Wed, 01 Jun 2011 12:38:23 -0700

On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov <[email protected]> wrote:


> Can you pre-aggregate your historical data to reduce the number of files?
>
> We used to partition our data by date but that created too many output
> files so now we partition by month.
>
> I do find it odd that Hive (0.6) can't merge compressed output files. We
> could have gotten away with daily partitioning if Hive could merge small
> files. I tried disabling compression but it actually caused some execution
> problems (perhaps xcievers -related I am not sure)
>
> On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan <[email protected]>wrote:
>
>> Today I tried CombineHiveInputFormat and set the max split size for hadoop
>> input. Seems I can get the expected map tasks number. But another problem is
>> the cpu is consumed highly by map tasks. almost 100%.
>>
>> I just ran a query with simple WHERE condition over testing files,whose
>> total size is about 30M and there are about 10 thousand small files. The
>> execution time over 700s. It's killing us.  Because the files are generated
>> by flume, all files is seq file.
>>
>>
>> R
>>
>> On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <[email protected]>wrote:
>>
>>> Hi Guys
>>>
>>> I use flume to store log file , and use hive to query.
>>>
>>> Flume always store the small file with suffix .seq Now I have over 35
>>> thousand seq files. Every time when I launch query script, 35 thousand map
>>> tasks will be created and it's so long time to wait for completing.
>>>
>>> I also try to set CombineHiveInputFormat, but if I set this option, it
>>> seems the task will be executed slowly. Because total size of the data
>>> folder over 700M.  Now in my testing env, I only have 3 data nodes. I also
>>> tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
>>> seems doesn't work. There's alway only one map task if
>>> set CombineHiveInputFormat.
>>>
>>> Can you plz show me a solution in which I can set map task number freely
>>>
>>> BTW: version for hadoop is 20 and hive is 0.5
>>>
>>> Richard
>>>
>>
>>
>
We have open sourced our filecrusher/optimizer, you post reminded be to
throw our new V2 version over the open source fence.

http://www.jointhegrid.com/hadoop_filecrush/index.jsp

I know many are looking for an in-hive solution, but file crusher does the
job for us.

Edward

Re: question about number of map tasks for small file

Reply via email to