Best strategy for Pandas -> Spark

Olivier Girardot Thu, 30 Apr 2015 01:43:30 -0700

Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
everything for the final computation) - some taking on average 1 minute to
complete and some taking more than 30 minutes.


What would be for you the best strategy to port this to Apache Spark ?

   - Transform the whole flow into a Spark Job (PySpark or Scala)
   - Transform only part of the flow (the heavy lifting ~30 min parts)
   using the same language (PySpark)
   - Transform only part of the flow and pipe the rest from Scala to Python

Regards,

Olivier.

Best strategy for Pandas -> Spark

Reply via email to