> Ok so what is the resolution here? My understanding is that bucketing
>does not improve the performance. Is that correct?

There are no right answers here - I spend a lot of time fixing
over-zealous optimization attempts

<http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5>

 
 
 

If you use bucketing to speed up a query without understanding the
interplay with other parameters involved (total # of partitions, size of
each partition, bucket col type, skew towards buckets), then it generally
ends up in disaster. And depending on whether you pay for HDP support or
not, I might turn out to be your tow-truck driver.

Bucketing in hive-1.0 is not a general performance feature - it was
intended as a scalability feature for JOINs.

Bucketed map-joins can be really slow if you're limited to MapReduce
(https://issues.apache.org/jira/browse/HIVE-4488).


Even excluding the execution on cluster, both Spark & Mapreduce serializes
the lookup tables before the query kicks off.

That means there is a pause before the big table can be read (or even
tasks scheduled) - Tez pipelines the scheduling with the generation, so
it's not so bad (through different EdgeManagers).

But a bucket map-join is still slower than a regular map-join within a
single task, because the Tez can cache the hashtable for the regular join
as it is the same one for any split it encounters in the vertex.

*SO*, if your mapjoins are OOM'ing you might want to consider bucketing -
otherwise they're wasted CPUs for JOINs.

With Hive-2.0, if Tez thinks your map-join might OOM, it might bucket data
at runtime & produce a dynamic version of a bucketed map-join

https://issues.apache.org/jira/browse/HIVE-10673


This feature by itself pays for all the complexity Tez has with its
runtime edge reconfiguration.

Back to filters.

You can get speedups in filter queries with bucketing, even in Hive-1.0
(if the data is sorted & clustered on same col).

Even then, the split-elimination with predicate lookups is unavailable to
any engine using CombineHiveInputFormat (i.e MapReduce & Spark).

So you have to be using ORC+Tez & sorting aligned precisely along your
lookup direction.

If you're on Parquet, it gets a little worse (+~3s or so for each task
wave spin up) and skip reading those rows entirely.

Cheers,
Gopal


Reply via email to