Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google Dataflow/Spark etc). I also understand that Spark itelf provides ways for creating a pipeline within mlib for MLtransforms (primarily fit) Both of the above are available in Java/Scala environment and the later being supported on Python as well.
However, if my understanding is correct, pipelines within mltransforms donot create a complete dataflow transform for non-ml scenarios (ex. io transforms, dataframe/graph transforms). Correct me if otherwise. I would like to know, what is the best way to create spark dataflow pipeline in a generic way. I have a use case where I have my input files in different formats and would like to convert them to rdd and further build the dataframe transforms and stream/store them finally. I hope not to do Disk I/Os between pipeline tasks. I also came across luigi(http://luigi.readthedocs.org/en/latest/) on Python, but I found that it stores the contents onto disc and reloads it for the next phase of the pipeline. Appreciate if you can share your thoughts. -- Regards, Suraj