Hi guys,
I'm trying to do a cross join (cartesian product) with 3 tables stored as
parquet. Each table has 1 column, a long key.
Table A has 60,000 keys with 1000 partitions
Table B has 1000 keys with 1 partition
Table C has 4 keys with 1 partition
The output should be 240million row combinations of the keys.
How should I partition and order the joins & more generally how does a
cartesian product work in Spark?
dfA.join(dfB).join(dfC)?
This seems very slow and sometimes results in the executors crashing (I'm
assuming with out of memory). I'm running with low parallelism (few cores
(5) big heap (30GB))
Is there another way to do this join? Should I somehow repartition the
records?
Cheers,
~N
Plan from the above operation;
== Physical Plan ==
CartesianProduct
CartesianProduct
Repartition 800, true
Project [Rand -2281254270918005092
AS _c0#159,KEYA#162L]
PhysicalRDD [KEYA#162L], MapPartitionsRDD[148] at
InMemoryColumnarTableScan [KEYB#117], (InMemoryRelation [KEYB#117], true,
2000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [KEYB#55],
MapPartitionsRDD[26] at), None)
InMemoryColumnarTableScan [KEYC#120], (InMemoryRelation [KEYC#120], true,
2000, StorageLevel(true, true, false, true, 1), (Repartition 1, true), None)