+1 Andy
From: darren <dar...@ontrenet.com> Date: Thursday, November 12, 2015 at 12:34 PM To: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com>, DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <nir...@wso2.com>, Andrew Davidson <a...@santacruzintegration.com>, Adrian Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>, "hol...@pigscanfly.ca" <hol...@pigscanfly.ca> Subject: RE: thought experiment: use spark ML to real time prediction > > I agree 100%. Making the model requires large data and many cpus. > > Using it does not. > > This is a very useful side effect of ML models. > > If mlib can't use models outside spark that's a real shame. > > > Sent from my Verizon Wireless 4G LTE smartphone > > > -------- Original message -------- > From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com> > Date: 11/12/2015 3:09 PM (GMT-05:00) > To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> > Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando > <nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian > Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>, Xiangrui > Meng <men...@gmail.com>, hol...@pigscanfly.ca > Subject: RE: thought experiment: use spark ML to real time prediction > > I am glad to see DB¹s comments, make me feel I am not the only one facing > these issues. If we are able to use MLLib to load the model in web > applications (outside the spark cluster), that would have solved the issue. I > understand Spark is manly for processing big data in a distributed mode. But, > there is no purpose in training a model using MLLib, if we are not able to use > it in applications where needs to access the model. > > Thanks > Viju > > From: DB Tsai [mailto:dbt...@dbtsai.com] > Sent: Thursday, November 12, 2015 11:04 AM > To: Sean Owen > Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user @spark; > Xiangrui Meng; hol...@pigscanfly.ca > Subject: Re: thought experiment: use spark ML to real time prediction > > > I think the use-case can be quick different from PMML. > > > > By having a Spark platform independent ML jar, this can empower users to do > the following, > > > > 1) PMML doesn't contain all the models we have in mllib. Also, for a ML > pipeline trained by Spark, most of time, PMML is not expressive enough to do > all the transformation we have in Spark ML. As a result, if we are able to > serialize the entire Spark ML pipeline after training, and then load them back > in app without any Spark platform for production scorning, this will be very > useful for production deployment of Spark ML models. The only issue will be if > the transformer involves with shuffle, we need to figure out a way to handle > it. When I chatted with Xiangrui about this, he suggested that we may tag if a > transformer is shuffle ready. Currently, at Netflix, we are not able to use ML > pipeline because of those issues, and we have to write our own scorers in our > production which is quite a duplicated work. > > > > 2) If users can use Spark's linear algebra like vector or matrix code in their > application, this will be very useful. This can help to share code in Spark > training pipeline and production deployment. Also, lots of good stuff at > Spark's mllib doesn't depend on Spark platform, and people can use them in > their application without pulling lots of dependencies. In fact, in my > project, I have to copy & paste code from mllib into my project to use those > goodies in apps. > > > > 3) Currently, mllib depends on graphx which means in graphx, there is no way > to use mllib's vector or matrix. And