Thank you very much for your quick answers!
The intention behind this is to improve the execution time and
(primarily) to examine the impact of block-co-location (research
project) for this particular query (simplified):
select A.x, B.y, A.z from tableA as A inner join tableB as B on A.id=B.id
The "real" query includes three joins and the data size is in pb-range.
Therefore several nodes (5 in the test environment with less data) are
used (without any load balancer).
Could you give me some hints what code changes are required and which
files are affected? I don't know how to give Impala the information that
it should only join the local data blocks on each node and then pass it
to the "final" node which receives all intermediate results. I hope you
can help me to get this working. That would be awesome!
Am 12.03.2018 um 18:38 schrieb Alexander Behm:
I suppose one exception is if your data lives only on a single node.
Then you can set num_nodes=1 and make sure to send the query request
to the impalad running on the same data node as the target data. Then
you should get a local join.
On Mon, Mar 12, 2018 at 9:30 AM, Alexander Behm
<alex.b...@cloudera.com <mailto:alex.b...@cloudera.com>> wrote:
Such a specific block arrangement is very uncommon for typical
Impala setups, so we don't attempt to recognize and optimize this
narrow case. In particular, such an arrangement tends to be short
lived if you have the HDFS balancer turned on.
Without making code changes, there is no way today to remove the
data exchanges and make sure that the scheduler assigns scan
splits to nodes in the desired way (co-located, but with possible
In what way is the current setup unacceptable to you? Is this
pre-mature optimization? If you have certain performance
expectations/requirements for specific queries we might be able to
help you improve those. If you want to pursue this route, please
help us by posting complete query profiles.
On Mon, Mar 12, 2018 at 6:29 AM, Philipp Krause
In order to prevent network traffic, I'd like to perform local
joins on each node instead of exchanging the data and perform
a join over the complete data afterwards. My query is
basically a join over three three tables on an ID attribute.
The blocks are perfectly distributed, so that e.g. Table A -
Block 0 and Table B - Block 0 are on the same node. These
blocks contain all data rows with an ID range [0,1]. Table A -
Block 1 and Table B - Block 1 with an ID range [2,3] are on
another node etc. So I want to perform a local join per node
because any data exchange would be unneccessary (except for
the last step when the final node recevieves all results of
the other nodes). Is this possible?
At the moment the query plan includes multiple data exchanges,
although the blocks are already perfectly distributed (manually).
I would be grateful for any help!