Why not do that with spark sql to utilise the executors properly, rather than a
sequential filter on the driver.
Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k
If you were sorting just so you could iterate in order, this might save you a
couple of sorts too.
https://rich
Hi all,
I am aware that collect will return a list aggregated on driver, this will
return OOM when we have a too big list.
Is toLocalIterator safe to use with very big list, i want to access all values
one by one.
Basically the goal is to compare two sorted rdds (A and B) to find top k
entries