Randomly distributed tables make Hawq2.x more elastic: big queries use more
resources, while small queries will use less resources.
But hash distributed table need to use the same number of resource as the
bucket number of table, no matter the query cost is large or small. As a
result, a scan on a very small table will also need a large number(default
6 * #node) of resources.
Moreover, HDFS blocks are distributed on different nodes. To achieve
better local disk read ratio, the data locality algorithm assign each block
to the appropriate virtual segment for randomly distributed table. But for
hash distributed table, all the blocks in one file must be processed by one
virtual segment(property of hash table), so we have to assign hash table at
file granularity. If blocks in a file are located on quite different nodes,
data locality ratio would become low, which often happens after cluster
expanding or shrinking.
So even if hash distributed table will reduce some data motion, we still
recommend to use randomly distributed table as default in Hawq2.x.
On Wed, Sep 21, 2016 at 2:14 PM, Vineet Goel <vvin...@apache.org> wrote:
> Hi all,
> I have received a fair number of questions on the topic of handling data
> locality and co-located joins in HAWQ 2. Most of the questions are coming
> from the background where HAWQ 1.x defaulted to HASH distributed tables
> distributed by a key and hence resulted in local joins in most cases for
> better performance.
> With the new architecture and RANDOM distribution policy as default, I
> thought it would be good to crowd-source some useful info here from the
> community on how performance is achieved with the new architecture and data
> distribution policy? Questions around how data movement is minimized,
> how/when dynamic redistribution is utilized, how joins are co-located etc.
> Can someone start by providing insights on this topic?