Re: thought experiment: use spark ML to real time prediction

Nick Pentreath Fri, 27 Nov 2015 00:45:55 -0800

Thanks for that link Vincenzo. PFA definitely seems interesting - though I
see it is quite wide in scope, almost like its own mini math/programming
language.


Do you know if there are any reference implementations in code? I don't see
any on the web site or the DMG github.

On Sun, Nov 22, 2015 at 2:24 PM, Vincenzo Selvaggio <vselvag...@gmail.com>
wrote:

> The Data Mining Group (http://dmg.org/) that created PMML are working on
> a new standard called PFA that indeed uses JSON documents, see
> http://dmg.org/pfa/docs/motivation/ for details.
>
> PFA could be the answer to your option c.
>
> Regards,
> Vincenzo
>
>
> On Wed, Nov 18, 2015 at 12:03 PM, Nick Pentreath <nick.pentre...@gmail.com
> > wrote:
>
>> One such "lightweight PMML in JSON" is here -
>> https://github.com/bigmlcom/json-pml. At least for the schema
>> definitions. But nothing available in terms of evaluation/scoring. Perhaps
>> this is something that can form a basis for such a new undertaking.
>>
>> I agree that distributed models are only really applicable in the case of
>> massive scale factor models - and then anyway for latency purposes one
>> needs to use LSH or something similar to achieve sufficiently real-time
>> performance. These days one can easily spin up a single very powerful
>> server to handle even very large models.
>>
>> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>
>>> I was thinking about to work on better version of PMML, JMML in JSON,
>>> but as you said, this requires a dedicated team to define the standard
>>> which will be a huge work.  However, option b) and c) still don't address
>>> the distributed models issue. In fact, most of the models in production
>>> have to be small enough to return the result to users within reasonable
>>> latency, so I doubt how usefulness of the distributed models in real
>>> production use-case. For R and Python, we can build a wrapper on-top of the
>>> lightweight "spark-ml-common" project.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> I think the issue with pulling in all of spark-core is often with
>>>> dependencies (and versions) conflicting with the web framework (or Akka in
>>>> many cases). Plus it really is quite heavy if you just want a fairly
>>>> lightweight model-serving app. For example we've built a fairly simple but
>>>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
>>>> really need is the web framework and Breeze (or an alternative linear
>>>> algebra lib).
>>>>
>>>> I definitely hear the pain-point that PMML might not be able to handle
>>>> some types of transformations or models that exist in Spark. However,
>>>> here's an example from scikit-learn -> PMML that may be instructive (
>>>> https://github.com/scikit-learn/scikit-learn/issues/1596 and
>>>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive
>>>> list of estimators and transformers are supported (including e.g. scaling
>>>> and encoding, and PCA).
>>>>
>>>> I definitely think the current model I/O and "export" or "deploy to
>>>> production" situation needs to be improved substantially. However, you are
>>>> left with the following options:
>>>>
>>>> (a) build out a lightweight "spark-ml-common" project that brings in
>>>> the dependencies needed for production scoring / transformation in
>>>> independent apps. However, here you only support Scala/Java - what about R
>>>> and Python? Also, what about the distributed models? Perhaps "local"
>>>> wrappers can be created, though this may not work for very large factor or
>>>> LDA models. See also H20 example
>>>> http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>>>>
>>>> (b) build out Spark's PMML support, and add missing stuff to PMML where
>>>> possible. The benefit here is an existing standard with various tools for
>>>> scoring (via REST server, Java app, Pig, Hive, various language support).
>>>>
>>>> (c) build out a more comprehensive I/O, serialization and scoring
>>>> framework. Here you face the issue of supporting various predictors and
>>>> transformers generically, across platforms and versioning. i.e. you're
>>>> re-creating a new standard like PMML
>>>>
>>>> Option (a) is do-able, but I'm a bit concerned that it may be too
>>>> "Spark specific", or even too "Scala / Java" specific. But it is still
>>>> potentially very useful to Spark users to build this out and have a
>>>> somewhat standard production serving framework and/or library (there are
>>>> obviously existing options like PredictionIO etc).
>>>>
>>>> Option (b) is really building out the existing PMML support within
>>>> Spark, so a lot of the initial work has already been done. I know some
>>>> folks had (or have) licensing issues with some components of JPMML (e.g.
>>>> the evaluator and REST server). But perhaps the solution here is to build
>>>> an Apache2-licensed evaluator framework.
>>>>
>>>> Option (c) is obviously interesting - "let's build a better PMML (that
>>>> uses JSON or whatever instead of XML!)". But it also seems like a huge
>>>> amount of reinventing the wheel, and like any new standard would take time
>>>> to garner wide support (if at all).
>>>>
>>>> It would be really useful to start to understand what the main missing
>>>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>>>> contribute improvements or additions to PMML.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>>> That may not be an issue if the app using the models runs by itself
>>>>> (not bundled into an existing app), which may actually be the right way to
>>>>> design it considering separation of concerns.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>>
>>>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>>
>>>>>> This will bring the whole dependencies of spark will may break the
>>>>>> web app.
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> ----------------------------------------------------------
>>>>>> Web: https://www.dbtsai.com
>>>>>> PGP Key ID: 0xAF08DF8D
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>>>
>>>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>>>
>>>>>>>> Using it does not.
>>>>>>>>
>>>>>>>> This is a very useful side effect of ML models.
>>>>>>>>
>>>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>>>>
>>>>>>>
>>>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>>>> the Spark Jars in the classpath though)
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>
>>>>>>>>
>>>>>>>> -------- Original message --------
>>>>>>>> From: "Kothuvatiparambil, Viju" <
>>>>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>>>>> hol...@pigscanfly.ca
>>>>>>>> Subject: RE: thought experiment: use spark ML to real time
>>>>>>>> prediction
>>>>>>>>
>>>>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>>>>> facing these issues. If we are able to use MLLib to load the model in 
>>>>>>>> web
>>>>>>>> applications (outside the spark cluster), that would have solved the
>>>>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>>>>> MLLib,
>>>>>>>> if we are not able to use it in applications where needs to access the
>>>>>>>> model.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Viju
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>>>>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM
>>>>>>>> *To:* Sean Owen
>>>>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase;
>>>>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>>>>> *Subject:* Re: thought experiment: use spark ML to real time
>>>>>>>> prediction
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think the use-case can be quick different from PMML.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By having a Spark platform independent ML jar, this can empower
>>>>>>>> users to do the following,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for
>>>>>>>> a ML pipeline trained by Spark, most of time, PMML is not expressive 
>>>>>>>> enough
>>>>>>>> to do all the transformation we have in Spark ML. As a result, if we 
>>>>>>>> are
>>>>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>>>>> load them back in app without any Spark platform for production 
>>>>>>>> scorning,
>>>>>>>> this will be very useful for production deployment of Spark ML models. 
>>>>>>>> The
>>>>>>>> only issue will be if the transformer involves with shuffle, we need to
>>>>>>>> figure out a way to handle it. When I chatted with Xiangrui about 
>>>>>>>> this, he
>>>>>>>> suggested that we may tag if a transformer is shuffle ready. 
>>>>>>>> Currently, at
>>>>>>>> Netflix, we are not able to use ML pipeline because of those issues, 
>>>>>>>> and we
>>>>>>>> have to write our own scorers in our production which is quite a 
>>>>>>>> duplicated
>>>>>>>> work.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2) If users can use Spark's linear algebra like vector or matrix
>>>>>>>> code in their application, this will be very useful. This can help to 
>>>>>>>> share
>>>>>>>> code in Spark training pipeline and production deployment. Also, lots 
>>>>>>>> of
>>>>>>>> good stuff at Spark's mllib doesn't depend on Spark platform, and 
>>>>>>>> people
>>>>>>>> can use them in their application without pulling lots of 
>>>>>>>> dependencies. In
>>>>>>>> fact, in my project, I have to copy & paste code from mllib into my 
>>>>>>>> project
>>>>>>>> to use those goodies in apps.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there
>>>>>>>> is no way to use mllib's vector or matrix. And
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Architect - Big Data
>>>>> Ph: +91 99805 99458
>>>>>
>>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>>> Sullivan India ICT)*
>>>>> +++
>>>>>
>>>>
>>>>
>>>
>>
>

Re: thought experiment: use spark ML to real time prediction

Reply via email to