Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
everything for the final computation) - some taking on average 1 minute to
complete and some taking more than 30 minutes.

What would be for you the best strategy to port this to Apache Spark ?

   - Transform the whole flow into a Spark Job (PySpark or Scala)
   - Transform only part of the flow (the heavy lifting ~30 min parts)
   using the same language (PySpark)
   - Transform only part of the flow and pipe the rest from Scala to Python

Regards,

Olivier.

Reply via email to