Dear Sparkers, I'm loading into DataFrames data from 5 sources (using official connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining those DataFrames in two different orders. - mongo * cassandra * jdbc * parquet * csv (random order). - parquet * csv * cassandra * jdbc * mongodb (optimized order).
The first follows a random order, whereas the second I'm deciding based on some optimization techniques (can provide details for the interested readers or if needed here). After the evaluation on increasing sizes of data, the optimization techniques I developed didn't improve the performance very noticeably. I inspected the Logical/Physical plan of the final joined DataFrame (using `explain(true)`). The 1st order was respected, whereas the 2nd order, it turned out, wasn't respected, and MongoDB was queried first. However, that what it seemed to me, I'm not quite confident reading the Plans (returned using explain(true)). Could someone help explaining the `explain(true)` output? (pasted in this gist <https://gist.github.com/mnmami/387c24de5dca86c3b8efe170965e9dcf>). Is there a way we could enforce the given order? I'm using Spark 2.1, so I think it doesn't include the new cost-based optimizations (introduced in Spark 2.2). *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.* *Mohamed Nadjib Mami* *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University* *About me! <http://mohamednadjibmami.com>* *LinkedIn <http://fr.linkedin.com/in/mohamednadjibmami/>*