Thank for the information Ruben. 1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642 does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint absolutely?
2. sorry for stupid questions, but I can't understand bucketing still. partitioning is ok, it is hdfs folders and I able to understand how it improve query execution. but what is bucketing in terms of storing data? 3. I embarrassed to ask such stupid questions, but is there 'how hive works' manual or something like? And again - sorry for bad English. Vyacheslav Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries <ruben.devr...@hyves.nl>: > Partitioning can greatly increase performance for WHERE clauses since hive > can omit parsing the data in the partitions which do no meet the requirement. > > For example if you partition by date (I do it by INT dateint, in which case I > set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it > won't even have to touch any of the data from before 2012-01-01 and in my > case that means I don't parse the last 2 years of data, reducing the time the > query takes by about 70% :-) > > > > Buckets are the second awesome way of getting a big optimization in, > specifically for joins! If you have 2 tables you're joining onto each other > then if they're both bucketed on their join column it will also greatly > increase speed. > > Another good join optimization is MAPJOIN, if one of the tables you're > joining is rather small (below 30mb) then you can force it to MAPJOIN or you > can enable automatic mapjoin, I personally prefere explicit behavory instead > of automagic so use a hint: > > SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN > the_small_table, etc. > > Sorted by is for sorting within buckets, only relevant if you're doing a lot > of ordering I think. > > > > I'm assuming sequencefiles are faster, but I wouldn't really know :( need > someone else to tell us more about that ;) > > > > > > -----Original Message----- > > From: Avdeev V. M. [mailto:ls...@list.ru] > > Sent: Monday, May 28, 2012 7:17 AM > > To: user@hive.apache.org > > Subject: table design and performance questions > > > > Question from novice. > > > > Where I can read table design best practices? I have a measure table with > millions of rows and many dimension tables with less than 1000 rows each. I > can't find out the way to get optimal design of both kind of tables. Is there > performance tuning guides or performance FAQ? > > > > Specifically > > 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using > these statements make sense? > > 2) DDL language manual says 'This can improve performance on certain kinds of > queries.' about CLUSTERED BY statement. What kind of queries can be improved? > > 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of > performance? What aspects should be taken into account when choosing a file > format? > > 4) Compressed storage article says 'Keeping data compressed in Hive tables > has, in some cases, known to give better performance that uncompressed > storage;' and again - What is these cases? > > > > Thanks! > > Vyacheslav > >