> Ok so what is the resolution here? My understanding is that bucketing >does not improve the performance. Is that correct?
There are no right answers here - I spend a lot of time fixing over-zealous optimization attempts <http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5> If you use bucketing to speed up a query without understanding the interplay with other parameters involved (total # of partitions, size of each partition, bucket col type, skew towards buckets), then it generally ends up in disaster. And depending on whether you pay for HDP support or not, I might turn out to be your tow-truck driver. Bucketing in hive-1.0 is not a general performance feature - it was intended as a scalability feature for JOINs. Bucketed map-joins can be really slow if you're limited to MapReduce (https://issues.apache.org/jira/browse/HIVE-4488). Even excluding the execution on cluster, both Spark & Mapreduce serializes the lookup tables before the query kicks off. That means there is a pause before the big table can be read (or even tasks scheduled) - Tez pipelines the scheduling with the generation, so it's not so bad (through different EdgeManagers). But a bucket map-join is still slower than a regular map-join within a single task, because the Tez can cache the hashtable for the regular join as it is the same one for any split it encounters in the vertex. *SO*, if your mapjoins are OOM'ing you might want to consider bucketing - otherwise they're wasted CPUs for JOINs. With Hive-2.0, if Tez thinks your map-join might OOM, it might bucket data at runtime & produce a dynamic version of a bucketed map-join https://issues.apache.org/jira/browse/HIVE-10673 This feature by itself pays for all the complexity Tez has with its runtime edge reconfiguration. Back to filters. You can get speedups in filter queries with bucketing, even in Hive-1.0 (if the data is sorted & clustered on same col). Even then, the split-elimination with predicate lookups is unavailable to any engine using CombineHiveInputFormat (i.e MapReduce & Spark). So you have to be using ORC+Tez & sorting aligned precisely along your lookup direction. If you're on Parquet, it gets a little worse (+~3s or so for each task wave spin up) and skip reading those rows entirely. Cheers, Gopal