Thank for the information Ruben.

1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642
does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint 
absolutely?

2. sorry for stupid questions, but I can't understand bucketing still. 
partitioning is ok, it is hdfs folders and I able to understand how it improve 
query execution. but what is bucketing in terms of storing data?

3. I embarrassed to ask such stupid questions, but is there 'how hive works' 
manual or something like?

And again  - sorry for bad English.

Vyacheslav

Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries <ruben.devr...@hyves.nl>:
> Partitioning can greatly increase performance for WHERE clauses since hive 
> can omit parsing the data in the partitions which do no meet the requirement.
> 
> For example if you partition by date (I do it by INT dateint, in which case I 
> set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it 
> won't even have to touch any of the data from before 2012-01-01 and in my 
> case that means I don't parse the last 2 years of data, reducing the time the 
> query takes by about 70% :-)
> 
> 
> 
> Buckets are the second awesome way of getting a big optimization in, 
> specifically for joins! If you have 2 tables you're joining onto each other 
> then if they're both bucketed on their join column it will also greatly 
> increase speed.
> 
> Another good join optimization is MAPJOIN, if one of the tables you're 
> joining is rather small (below 30mb) then you can force it to MAPJOIN or you 
> can enable automatic mapjoin, I personally prefere explicit behavory instead 
> of automagic so use a hint:
> 
> SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN 
> the_small_table, etc.
> 
> Sorted by is for sorting within buckets, only relevant if you're doing a lot 
> of ordering I think.
> 
> 
> 
> I'm assuming sequencefiles are faster, but I wouldn't really know :( need 
> someone else to tell us more about that ;)
> 
> 
> 
> 
> 
> -----Original Message-----
> 
> From: Avdeev V. M. [mailto:ls...@list.ru] 
> 
> Sent: Monday, May 28, 2012 7:17 AM
> 
> To: user@hive.apache.org
> 
> Subject: table design and performance questions
> 
> 
> 
> Question from novice.
> 
> 
> 
> Where I can read table design best practices? I have a measure table with 
> millions of rows and many dimension tables with less than 1000 rows each. I 
> can't find out the way to get optimal design of both kind of tables. Is there 
> performance tuning guides or performance FAQ?
> 
> 
> 
> Specifically
> 
> 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
> these statements make sense?
> 
> 2) DDL language manual says 'This can improve performance on certain kinds of 
> queries.' about CLUSTERED BY statement. What kind of queries can be improved?
> 
> 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
> performance? What aspects should be taken into account when choosing a file 
> format?
> 
> 4) Compressed storage article says 'Keeping data compressed in Hive tables 
> has, in some cases, known to give better performance that uncompressed 
> storage;' and again - What is these cases? 
> 
> 
> 
> Thanks!
> 
> Vyacheslav
> 
> 

Reply via email to