Pipeline in pyspark

Suraj Shetiya Thu, 23 Apr 2015 00:13:07 -0700

Hi,

I have come across ways of building pipeline of input/transform and output
pipelines with Java (Google Dataflow/Spark etc). I also understand that
Spark itelf provides ways for creating a pipeline within mlib for
MLtransforms (primarily fit) Both of the above are available in Java/Scala
environment and the later being supported on Python as well.


However, if my understanding is correct, pipelines within mltransforms
donot create a complete dataflow transform for non-ml scenarios (ex. io
transforms, dataframe/graph transforms). Correct me if otherwise. I would
like to know, what is the best way to create spark dataflow pipeline in a
generic way. I have a use case where I have my input files in different
formats and would like to convert them to rdd and further build the
dataframe transforms and stream/store them finally. I hope not to do Disk
I/Os between pipeline tasks.

 I also came across luigi(http://luigi.readthedocs.org/en/latest/) on
Python, but I found that it stores the contents onto disc and reloads it
for the next phase of the pipeline.

Appreciate if you can share your thoughts.


-- 
Regards,
Suraj

Pipeline in pyspark

Reply via email to