I'm very much in favor of this, the less porting work there is the better :)
On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley <jos...@databricks.com> wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> This sounds good to me as well. The one thing we should pay attention to >> is how we update the docs so that people know to start with the spark.ml >> classes. Right now the docs list spark.mllib first and also seem more >> comprehensive in that area than in spark.ml, so maybe people naturally >> move towards that. >> >> Matei >> >> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <m...@databricks.com> wrote: >> >> Yes, DB (cc'ed) is working on porting the local linear algebra library >> over (SPARK-13944). There are also frequent pattern mining algorithms we >> need to port over in order to reach feature parity. -Xiangrui >> >> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> Overall this sounds good to me. One question I have is that in >>> addition to the ML algorithms we have a number of linear algebra >>> (various distributed matrices) and statistical methods in the >>> spark.mllib package. Is the plan to port or move these to the spark.ml >>> namespace in the 2.x series ? >>> >>> Thanks >>> Shivaram >>> >>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote: >>> > FWIW, all of that sounds like a good plan to me. Developing one API is >>> > certainly better than two. >>> > >>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> >>> wrote: >>> >> Hi all, >>> >> >>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API >>> built >>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based >>> API has >>> >> been developed under the spark.ml package, while the old RDD-based >>> API has >>> >> been developed in parallel under the spark.mllib package. While it was >>> >> easier to implement and experiment with new APIs under a new package, >>> it >>> >> became harder and harder to maintain as both packages grew bigger and >>> >> bigger. And new users are often confused by having two sets of APIs >>> with >>> >> overlapped functions. >>> >> >>> >> We started to recommend the DataFrame-based API over the RDD-based >>> API in >>> >> Spark 1.5 for its versatility and flexibility, and we saw the >>> development >>> >> and the usage gradually shifting to the DataFrame-based API. Just >>> counting >>> >> the lines of Scala code, from 1.5 to the current master we added >>> ~10000 >>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, >>> to >>> >> gather more resources on the development of the DataFrame-based API >>> and to >>> >> help users migrate over sooner, I want to propose switching RDD-based >>> MLlib >>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly? >>> >> >>> >> * We do not accept new features in the RDD-based spark.mllib package, >>> unless >>> >> they block implementing new features in the DataFrame-based spark.ml >>> >> package. >>> >> * We still accept bug fixes in the RDD-based API. >>> >> * We will add more features to the DataFrame-based API in the 2.x >>> series to >>> >> reach feature parity with the RDD-based API. >>> >> * Once we reach feature parity (possibly in Spark 2.2), we will >>> deprecate >>> >> the RDD-based API. >>> >> * We will remove the RDD-based API from the main Spark repo in Spark >>> 3.0. >>> >> >>> >> Though the RDD-based API is already in de facto maintenance mode, this >>> >> announcement will make it clear and hence important to both MLlib >>> developers >>> >> and users. So we’d greatly appreciate your feedback! >>> >> >>> >> (As a side note, people sometimes use “Spark ML” to refer to the >>> >> DataFrame-based API or even the entire MLlib component. This also >>> causes >>> >> confusion. To be clear, “Spark ML” is not an official name and there >>> are no >>> >> plans to rename MLlib to “Spark ML” at this time.) >>> >> >>> >> Best, >>> >> Xiangrui >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: user-h...@spark.apache.org >>> > >>> >> >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau