Replacing BroadcastNestedLoopJoin

Eric Beabes Thu, 12 Aug 2021 16:10:29 -0700

We’ve two datasets that look like this:

Dataset A: App specific data that contains (among other fields): ip_address



Dataset B: Location data that contains start_ip_address_int,
end_ip_address_int, latitude, longitude

We’re (left) joining these two datasets as: A.ip_address >=
B.start_ip_address_int AND A.ip_address <= B.end_ip_address_int. When
there's a match, we pick latitude & longitude from Dataset B.

This works fine but it takes a LONG time (over 20 minutes) to complete for
SMALL datasets.

Dataset A => Usually contains 110,000

Dataset B => Contains 12.5 Million rows. This is “static” data. Hasn’t
changed since August 2020.

When we looked at the DAG, it seems a BroadcastNestedLoopJoin is getting
used which supposedly is very slow. It seems Spark selects it by default
when we have “in equal” conditions such as “greater than”, “less than”.

What’s the best way to speed up this process? Thanks in advance.

Replacing BroadcastNestedLoopJoin

Reply via email to