Thank you very much for your quick answers!
The intention behind this is to improve the execution time and (primarily) to examine the impact of block-co-location (research project) for this particular query (simplified):

select A.x, B.y, A.z from tableA as A inner join tableB as B on

The "real" query includes three joins and the data size is in pb-range. Therefore several nodes (5 in the test environment with less data) are used (without any load balancer).

Could you give me some hints what code changes are required and which files are affected? I don't know how to give Impala the information that it should only join the local data blocks on each node and then pass it to the "final" node which receives all intermediate results. I hope you can help me to get this working. That would be awesome!

Best regards

Am 12.03.2018 um 18:38 schrieb Alexander Behm:
I suppose one exception is if your data lives only on a single node. Then you can set num_nodes=1 and make sure to send the query request to the impalad running on the same data node as the target data. Then you should get a local join.

On Mon, Mar 12, 2018 at 9:30 AM, Alexander Behm < <>> wrote:

    Such a specific block arrangement is very uncommon for typical
    Impala setups, so we don't attempt to recognize and optimize this
    narrow case. In particular, such an arrangement tends to be short
    lived if you have the HDFS balancer turned on.

    Without making code changes, there is no way today to remove the
    data exchanges and make sure that the scheduler assigns scan
    splits to nodes in the desired way (co-located, but with possible
    load imbalance).

    In what way is the current setup unacceptable to you? Is this
    pre-mature optimization? If you have certain performance
    expectations/requirements for specific queries we might be able to
    help you improve those. If you want to pursue this route, please
    help us by posting complete query profiles.


    On Mon, Mar 12, 2018 at 6:29 AM, Philipp Krause
    <>> wrote:

        Hello everyone!

        In order to prevent network traffic, I'd like to perform local
        joins on each node instead of exchanging the data and perform
        a join over the complete data afterwards. My query is
        basically a join over three three tables on an ID attribute.
        The blocks are perfectly distributed, so that e.g. Table A -
        Block 0  and Table B - Block 0  are on the same node. These
        blocks contain all data rows with an ID range [0,1]. Table A -
        Block 1  and Table B - Block 1 with an ID range [2,3] are on
        another node etc. So I want to perform a local join per node
        because any data exchange would be unneccessary (except for
        the last step when the final node recevieves all results of
        the other nodes). Is this possible?
        At the moment the query plan includes multiple data exchanges,
        although the blocks are already perfectly distributed (manually).
        I would be grateful for any help!

        Best regards
        Philipp Krause

Reply via email to