That's quite informative, Michal. Though I don't read the first few slides which are not in English.
On Sat, Mar 26, 2016 at 6:12 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Ted, > > Sure. This was presented by my colleague during Data Science London > meetup. The talk was about "Scalable Predictive Pipelines with Spark & > Scala". Link to the meetup and slides below: > > http://www.meetup.com/Data-Science-London/events/229755935/ > > http://files.meetup.com/3183732/Scalable%20Predictive%20Pipelines%20with%20Spark%20and%20Scala.pdf > > > ---------- Forwarded message ---------- > From: Ted Yu <yuzhih...@gmail.com> > Date: 26 March 2016 at 12:51 > Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to > DataFrames)? > To: Michał Zieliński <zielinski.mich...@gmail.com> > > > Michal: > Can you share the slide deck ? > > Thanks > > On Mar 26, 2016, at 4:10 AM, Michał Zieliński <zielinski.mich...@gmail.com> > wrote: > > Spark ML Pipelines API (not just Transformers, Estimators and custom > Pipelines classes as well) are definitely not just machine-learning > specific. > > We use them heavily in our developement. We're building machine learning > pipelines *BUT* many steps involve joining, schema manipulation, > pre/postprocessing data for the actual statistical algorithm, having > monoidal architecture (I have a slide deck if you're interested). > > Pipelines API is a powerful abstraction that makes things very easy for > us. They are not always perfect (imho transformSchema is a little bit of a > mess, maybe future Dataset API will help), but they make our pipelines very > customisable and pluggable (you can add/swap/remove any PipelineStage and > any point). > > On 26 March 2016 at 09:26, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi Joseph, >> >> Thanks for the response. I'm one who doesn't understand all the >> hype/need for Machine Learning...yet and through Spark ML(lib) glasses >> I'm looking at ML space. In the meantime I've got few assignments (in >> a project with Spark and Scala) that have required quite extensive >> dataset manipulation. >> >> It was when I sinked into using DataFrame/Dataset for data >> manipulation not RDD (I remember talking to Brian about how RDD is an >> "assembly" language comparing to the higher-level concept of >> DataFrames with Catalysts and other optimizations). After few days >> with DataFrame I learnt he was so right! (sorry Brian, it took me >> longer to understand your point). >> >> I started using DataFrames in far too many places than one could ever >> accept :-) I was so...carried away with DataFrames (esp. show vs >> foreach(println) and UDFs via udf() function) >> >> And then, when I moved to Pipeline API and discovered Transformers. >> And PipelineStage that can create pipelines of DataFrame manipulation. >> They read so well that I'm pretty sure people would love using them >> more often, but...they belong to MLlib so they are part of ML space >> (not many devs tackled yet). I applied the approach to using >> withColumn to have better debugging experience (if I ever need it). I >> learnt it after having watched your presentation about Pipeline API. >> It was so helpful in my RDD/DataFrame space. >> >> So, to promote a more extensive use of Pipelines, PipelineStages, and >> Transformers, I was thinking about moving that part to SQL/DataFrame >> API where they really belong. If not, I think people might miss the >> beauty of the very fine and so helpful Transformers. >> >> Transformers are *not* a ML thing -- they are DataFrame thing and >> should be where they really belong (for their greater adoption). >> >> What do you think? >> >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark http://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com> >> wrote: >> > There have been some comments about using Pipelines outside of ML, but I >> > have not yet seen a real need for it. If a user does want to use >> Pipelines >> > for non-ML tasks, they still can use Transformers + PipelineModels. >> Will >> > that work? >> > >> > On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl> >> wrote: >> >> >> >> Hi, >> >> >> >> After few weeks with spark.ml now, I came to conclusion that >> >> Transformer concept from Pipeline API (spark.ml/MLlib) should be part >> >> of DataFrame (SQL) where they fit better. Are there any plans to >> >> migrate Transformer API (ML) to DataFrame (SQL)? >> >> >> >> Pozdrawiam, >> >> Jacek Laskowski >> >> ---- >> >> https://medium.com/@jaceklaskowski/ >> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark >> >> Follow me at https://twitter.com/jaceklaskowski >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > >