We’ve two datasets that look like this: Dataset A: App specific data that contains (among other fields): ip_address
Dataset B: Location data that contains start_ip_address_int, end_ip_address_int, latitude, longitude We’re (left) joining these two datasets as: A.ip_address >= B.start_ip_address_int AND A.ip_address <= B.end_ip_address_int. When there's a match, we pick latitude & longitude from Dataset B. This works fine but it takes a LONG time (over 20 minutes) to complete for SMALL datasets. Dataset A => Usually contains 110,000 Dataset B => Contains 12.5 Million rows. This is “static” data. Hasn’t changed since August 2020. When we looked at the DAG, it seems a BroadcastNestedLoopJoin is getting used which supposedly is very slow. It seems Spark selects it by default when we have “in equal” conditions such as “greater than”, “less than”. What’s the best way to speed up this process? Thanks in advance.