Hello everyone, Thanks for sharing valuable inputs. I am working on similar kind of task, it will be really helpful if you can share the command for increasing the heap size of hive-cli/launching process.
Thanks, Saurabh Sent from my iPhone, please avoid typos. > On 18-Jul-2014, at 8:23 pm, Edward Capriolo <edlinuxg...@gmail.com> wrote: > > Unleash ze file crusha! > > https://github.com/edwardcapriolo/filecrush > > >> On Fri, Jul 18, 2014 at 10:51 AM, diogo <di...@uken.com> wrote: >> Sweet, great answers, thanks. >> >> Indeed, I have a small number of partitions, but lots of small files, ~20MB >> each. I'll make sure to combine them. Also, increasing the heap size of the >> cli process already helped speed it up. >> >> Thanks, again. >> >> >>> On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> The planning phase needs to do work for every hive partition and every >>> hadoop files. If you have a lot of 'small' files or many partitions this >>> can take a long time. >>> Also the planning phase that happens on the job tracker is single threaded. >>> Also the new yarn stuff requires back and forth to allocated containers. >>> >>> Sometimes raising the heap to for the hive-cli/launching process helps >>> because the default heap of 1 GB may not be a lot of space to deal with all >>> of the partition information and memory overhead will make this go faster. >>> Sometimes setting the min split size higher launches less map tasks which >>> speeds up everything. >>> >>> So the answer...Try to tune everything, start hive like this: >>> >>> bin/hive -hiveconf hive.root.logger=DEBUG,console >>> >>> And record where the longest spaces with no output are, that is what you >>> should try to tune first. >>> >>> >>> >>> >>>> On Fri, Jul 18, 2014 at 9:36 AM, diogo <di...@uken.com> wrote: >>>> This is probably a simple question, but I'm noticing that for queries that >>>> run on 1+TB of data, it can take Hive up to 30 minutes to actually start >>>> the first map-reduce stage. What is it doing? I imagine it's gathering >>>> information about the data somehow, this 'startup' time is clearly a >>>> function of the amount of data I'm trying to process. >>>> >>>> Cheers, >