Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes.
What would be for you the best strategy to port this to Apache Spark ? - Transform the whole flow into a Spark Job (PySpark or Scala) - Transform only part of the flow (the heavy lifting ~30 min parts) using the same language (PySpark) - Transform only part of the flow and pipe the rest from Scala to Python Regards, Olivier.