Re: Performance tuning in hive

Abhishek Fri, 28 Sep 2012 13:46:04 -0700

Thanks Bejoy.I did it thank you

Sent from my iPhone


On Sep 28, 2012, at 2:52 PM, "Bejoy KS" <bejoy...@yahoo.com> wrote:

> Hi Abshiek
> 
> I don't think Partition By and Clustered By is supported in CTAS.
> 
> You need to create the bucketed
> Table separately, then enable hive.enforce.bucketing , after that use Select 
> statement from the parent table to load data into the bucketed one.
> 
> Regards
> Bejoy KS
> 
> Sent from handheld, please excuse typos.
> From: Abhishek <abhishek.dod...@gmail.com>
> Date: Fri, 28 Sep 2012 11:14:56 -0400
> To: Bejoy Ks<bejoy...@yahoo.com>
> ReplyTo: user@hive.apache.org
> Cc: user@hive.apache.org<user@hive.apache.org>
> Subject: Re: Performance tuning in hive
> 
> Hi Bejoy,
> 
> How to use CTAS with Clustered By. 
> 
> I am getting following error when doing
> 
> Create table as select
> 
> CTAS does not support partitioning in the target table.
> 
> Regards
> Abhi
> 
> Sent from my iPhone
> 
> On Sep 28, 2012, at 5:32 AM, Bejoy KS <bejoy...@yahoo.com> wrote:
> 
>> Hi Abshiek
>> 
>> Which optimization you have to choose totally depends o your queries or the 
>> kind of queries fired on those tables. Based on that you need to bucket and 
>> index them to get better performance. From a birds eye point of view, 
>> bucketing + indexing + map joins would be a good combination if those suits 
>> your data set.
>>  
>> Regards,
>> Bejoy KS
>> 
>> From: Abhishek <abhishek.dod...@gmail.com>
>> To: "user@hive.apache.org" <user@hive.apache.org> 
>> Cc: "user@hive.apache.org" <user@hive.apache.org> 
>> Sent: Friday, September 28, 2012 5:16 AM
>> Subject: Re: Performance tuning in hive
>> 
>> Hi Bejoy,
>> 
>> Thanks for the reply.Can I know whether combination of
>> 1) Indexing and Bucketing  
>>        Or
>> 2) bucketing with Rc file
>>      Or
>> 3) sequence file with bucketing and indexing
>>    Or
>> 4) map join with indexes 
>>   Or
>> 
>> Any other combination of above mentioned or non mentioned, would fetch a 
>> better performance.
>> 
>> Regards
>> Abhi
>> 
>> Sent from my iPhone
>> 
>> On Sep 27, 2012, at 2:44 PM, Bejoy KS <bejoy...@yahoo.com> wrote:
>> 
>>> Hi Abshiek
>>> 
>>> You can have a look at join optimizations as well as group by optimizations
>>> 
>>> Join optimization - Based on your data sets you can go in with map side 
>>> join or bucketed map join or
>>> to enable map join -> set hive.auto.convert.join = true;
>>> 
>>> to enable bucketed map join ->  set hive.optimize.bucketmapjoin = true (    
>>> The prerequisite here is both the tables should be bucketed on the join 
>>> column.)
>>> If the data in buckets are sorted then you can go in with a sort merge join 
>>> as well, you need to enable the following properties
>>>  set 
>>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>>   set hive.optimize.bucketmapjoin = true;
>>>   set hive.optimize.bucketmapjoin.sortedmerge = true;
>>> 
>>> For details you can refer the following url
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>>> 
>>> Group By OPtimization - You can go ahead with a few group by optimizations 
>>> as well. A few pointers in here
>>> http://mail-archives.apache.org/mod_mbox/hive-user/201209.mbox/%3cb55ff166-239e-4e39-bf92-3ae59eb78...@gmail.com%3E
>>> 
>>> 
>>> Hive Indexes - Join and Group by gets optimized better with buckets. Based 
>>> on your query you need to pre determine how your tables need to be 
>>> bucketed. Indexing also gives you great performance advantage over queries 
>>> that involves group by and where. Join optimization using indexes is in 
>>> progress
>>> https://issues.apache.org/jira/browse/HIVE-2845
>>> 
>>> 
>>> RC file or Sequence File is a choice to be made based on the query 
>>> patterns. If you are querying only a few columns then RC files gives you a 
>>> performance edge but if the queries are spanned across pretty much all 
>>> columns then use the more generalized Sequence Files.
>>> 
>>>  
>>> Regards,
>>> Bejoy KS
>>> 
>>> From: Abhishek <abhishek.dod...@gmail.com>
>>> To: Hive <user@hive.apache.org> 
>>> Sent: Thursday, September 27, 2012 7:03 PM
>>> Subject: Performance tuning in hive
>>> 
>>> Hi all,
>>> 
>>> I am trying to increase the performance of some queries in hive, all 
>>> queries mostly contain left outer join , group by and conditional checks, 
>>> union all. I have over riden some properities in hive shell 
>>> 
>>> Set io.sort.mb=512
>>> Set io.sort.factor=100
>>> Set mapred.child.jvm.opts=-Xmx2048mb
>>> Set hive.map.aggr=true
>>> Set hive.exec.parallel=true
>>> Set mapred.tasks.reuse.num.tasks=-1
>>> Set hive.mapred.map.speculative.execution=false
>>> Set hive.mapred.reduce.speculative.execution=false
>>> 
>>> I got some performance gain.
>>> 
>>> Still want to improve the performance of these queries
>>> 
>>> Which of the following gives me better performance 
>>> 
>>> Rcfile
>>> Indexing
>>> Bucketing
>>> Sequence file 
>>> Combination of above
>>> 
>>> Or 
>>> 
>>> Some configuration parameter tuning
>>> 
>>> Which one from above yields good performance??
>>> 
>>> Thanks in advance.
>>> 
>>> Regards
>>> Abhi
>>> 
>>> 
>>> 
>> 
>>

Re: Performance tuning in hive

Reply via email to