I have about 15 -20 joins to perform. Each of these tables are in the order of 6 million to 66 million rows. The number of columns range from 20 are 400.
I read the parquet files and obtain schemaRDDs. Then use join functionality on 2 SchemaRDDs. I join the previous join results with the next schemaRDD. Any ideas how to deal with such join intensive spark SQL process? Any advise how to handle joins in better ways? I will appreciate all the inputs. Thanks!