My conclusion is that from performance point of view as of Hive 1.2.1 bucketing has limited scope if any. I cannot comment on Hive 2 as I have not got it.
Thanks, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908. pdf Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. -----Original Message----- From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal Vijayaraghavan Sent: 26 January 2016 21:36 To: user@hive.apache.org Subject: Re: Hive Bucketing > Ok so what is the resolution here? My understanding is that bucketing >does not improve the performance. Is that correct? There are no right answers here - I spend a lot of time fixing over-zealous optimization attempts <http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5> If you use bucketing to speed up a query without understanding the interplay with other parameters involved (total # of partitions, size of each partition, bucket col type, skew towards buckets), then it generally ends up in disaster. And depending on whether you pay for HDP support or not, I might turn out to be your tow-truck driver. Bucketing in hive-1.0 is not a general performance feature - it was intended as a scalability feature for JOINs. Bucketed map-joins can be really slow if you're limited to MapReduce (https://issues.apache.org/jira/browse/HIVE-4488). Even excluding the execution on cluster, both Spark & Mapreduce serializes the lookup tables before the query kicks off. That means there is a pause before the big table can be read (or even tasks scheduled) - Tez pipelines the scheduling with the generation, so it's not so bad (through different EdgeManagers). But a bucket map-join is still slower than a regular map-join within a single task, because the Tez can cache the hashtable for the regular join as it is the same one for any split it encounters in the vertex. *SO*, if your mapjoins are OOM'ing you might want to consider bucketing - otherwise they're wasted CPUs for JOINs. With Hive-2.0, if Tez thinks your map-join might OOM, it might bucket data at runtime & produce a dynamic version of a bucketed map-join https://issues.apache.org/jira/browse/HIVE-10673 This feature by itself pays for all the complexity Tez has with its runtime edge reconfiguration. Back to filters. You can get speedups in filter queries with bucketing, even in Hive-1.0 (if the data is sorted & clustered on same col). Even then, the split-elimination with predicate lookups is unavailable to any engine using CombineHiveInputFormat (i.e MapReduce & Spark). So you have to be using ORC+Tez & sorting aligned precisely along your lookup direction. If you're on Parquet, it gets a little worse (+~3s or so for each task wave spin up) and skip reading those rows entirely. Cheers, Gopal