Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Ted Yu Sat, 26 Mar 2016 07:31:20 -0700

That's quite informative, Michal.

Though I don't read the first few slides which are not in English.


On Sat, Mar 26, 2016 at 6:12 AM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Ted,
>
> Sure. This was presented by my colleague during Data Science London
> meetup. The talk was about "Scalable Predictive Pipelines with Spark &
> Scala". Link to the meetup and slides below:
>
> http://www.meetup.com/Data-Science-London/events/229755935/
>
> http://files.meetup.com/3183732/Scalable%20Predictive%20Pipelines%20with%20Spark%20and%20Scala.pdf
>
>
> ---------- Forwarded message ----------
> From: Ted Yu <yuzhih...@gmail.com>
> Date: 26 March 2016 at 12:51
> Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to
> DataFrames)?
> To: Michał Zieliński <zielinski.mich...@gmail.com>
>
>
> Michal:
> Can you share the slide deck ?
>
> Thanks
>
> On Mar 26, 2016, at 4:10 AM, Michał Zieliński <zielinski.mich...@gmail.com>
> wrote:
>
> Spark ML Pipelines API (not just Transformers, Estimators and custom
> Pipelines classes as well) are definitely not just machine-learning
> specific.
>
> We use them heavily in our developement. We're building machine learning
> pipelines *BUT* many steps involve joining, schema manipulation,
> pre/postprocessing data for the actual statistical algorithm, having
> monoidal architecture (I have a slide deck if you're interested).
>
> Pipelines API is a powerful abstraction that makes things very easy for
> us. They are not always perfect (imho transformSchema is a little bit of a
> mess, maybe future Dataset API will help), but they make our pipelines very
> customisable and pluggable (you can add/swap/remove any PipelineStage and
> any point).
>
> On 26 March 2016 at 09:26, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Joseph,
>>
>> Thanks for the response. I'm one who doesn't understand all the
>> hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>> I'm looking at ML space. In the meantime I've got few assignments (in
>> a project with Spark and Scala) that have required quite extensive
>> dataset manipulation.
>>
>> It was when I sinked into using DataFrame/Dataset for data
>> manipulation not RDD (I remember talking to Brian about how RDD is an
>> "assembly" language comparing to the higher-level concept of
>> DataFrames with Catalysts and other optimizations). After few days
>> with DataFrame I learnt he was so right! (sorry Brian, it took me
>> longer to understand your point).
>>
>> I started using DataFrames in far too many places than one could ever
>> accept :-) I was so...carried away with DataFrames (esp. show vs
>> foreach(println) and UDFs via udf() function)
>>
>> And then, when I moved to Pipeline API and discovered Transformers.
>> And PipelineStage that can create pipelines of DataFrame manipulation.
>> They read so well that I'm pretty sure people would love using them
>> more often, but...they belong to MLlib so they are part of ML space
>> (not many devs tackled yet). I applied the approach to using
>> withColumn to have better debugging experience (if I ever need it). I
>> learnt it after having watched your presentation about Pipeline API.
>> It was so helpful in my RDD/DataFrame space.
>>
>> So, to promote a more extensive use of Pipelines, PipelineStages, and
>> Transformers, I was thinking about moving that part to SQL/DataFrame
>> API where they really belong. If not, I think people might miss the
>> beauty of the very fine and so helpful Transformers.
>>
>> Transformers are *not* a ML thing -- they are DataFrame thing and
>> should be where they really belong (for their greater adoption).
>>
>> What do you think?
>>
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>> > There have been some comments about using Pipelines outside of ML, but I
>> > have not yet seen a real need for it.  If a user does want to use
>> Pipelines
>> > for non-ML tasks, they still can use Transformers + PipelineModels.
>> Will
>> > that work?
>> >
>> > On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> After few weeks with spark.ml now, I came to conclusion that
>> >> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
>> >> of DataFrame (SQL) where they fit better. Are there any plans to
>> >> migrate Transformer API (ML) to DataFrame (SQL)?
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> ----
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Reply via email to