RE: Optimization on bucketized/sorted tables

mdefoinplatel.ext Fri, 23 Mar 2012 03:39:23 -0700

Hi Mark;
Many thanks for your support !

Amongst the four query types I mentioned I was at least expecting that the 
first one will be optimized. I noticed the hive.optimize.groupby parameter and 
I thought this case was sorted out already.


So if I understand you well, the only point in storing a table bucketized 
and/or sorted at the moment resides in the optimization of the join operations, 
right ? Could you please give me an example of such a query that would benefit 
from maintaining the tables sorted ?

Cheers,
Michael

-----Message d'origine-----
De : Mark Grover [mailto:mgro...@oanda.com] 
Envoyé : jeudi 22 mars 2012 17:58
À : user@hive.apache.org
Objet : Re: Optimization on bucketized/sorted tables

Hi Michael,
This JIRA is along the lines of your questions:
https://issues.apache.org/jira/browse/HIVE-2846

The following is based on my understanding so take it with a grain of salt:-)
You're right. The 4 kinds of queries you pointed out can be potentially be 
optimized if the source table(s) are bucketed and/or sorted by the appropriate 
columns. Another query that could be optimized based on bucketing/sorting is 
join. This is presently being done in bucketed map joins and sort merge joins.

However, like the JIRA ticket mentions, the bucketing/sorting information isn't 
presently stored in the metastore, so the queries can't make use of them 
without specifying hints like joins do.

To answer your last question, I think you have to explicitly (at least for now) 
mention DISTRIBUTE BY and SORT BY in your query and that's what I do in my 
queries too.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "mdefoinplatel ext" <mdefoinplatel....@orange.com>
To: user@hive.apache.org
Sent: Tuesday, March 20, 2012 10:19:41 AM
Subject: Optimization on bucketized/sorted tables




Hi folks, 

I have several questions about optimization in Hive, they are mainly related to 
bucketized/sorted tables. 



Let say I have a table T bucketized on user_id and sorted by user_id, time. 



CREATE TABLE T 

( user_id BIGINT, 

time INT 

) 

CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS; 





In a general way, I wonder which of the following operations will benefit from 
the fact that T is bucketized and sorted. 



1) Group by 

SELECT user_id, count(time) FROM T GROUP BY user_id; 



2) Distribute by 

SELECT user_id, time FROM T DISTRIBUTE BY user_id; 



3) Distribute by, Sort by 

SELECT user_id, time FROM T DISTRIBUTE BY user_id SORT BY user_id, time; 



4) Insert into a bucketized/sorted table 



CREATE TABLE T2 

( user_id BIGINT, 

time INT 

) 

CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS; 



set hive.enforce.bucketing = true; 



INSERT OVERWRITE TABLE T2 SELECT T.user_id, T.time FROM T; 





Finally, on a slightly more specific topic… 



Let say I want to perform the ‘sessionization’ on the table T and I am planning 
to call a python script to do that job. 

To get a valid answer I must ensure that the data are sorted by user_id,time 
and that all the data for a given user_id are processed by a single call to my 
script. 



I am planning to run the following query: 

FROM (SELECT user_Id, time FROM T DISTRIBUTE BY user_id SORT BY user_id, time) 
s SELECT TRANSFORM (s.user_id, s.time) USING 'python session.py' AS user_id, 
avg_session, nb_session; 



So I wonder first if this is the correct approach and second if the ‘ 
DISTRIBUTE BY user_id SORT BY user_id, time ’ clauses are required knowing that 
T is already bucketized and sorted on the right columns. 



Many thanks in advance for your help, 

Michael 

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.
_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.

RE: Optimization on bucketized/sorted tables

Reply via email to