Re: thought experiment: use spark ML to real time prediction

Nick Pentreath Wed, 18 Nov 2015 04:04:26 -0800

One such "lightweight PMML in JSON" is here -
https://github.com/bigmlcom/json-pml. At least for the schema definitions.
But nothing available in terms of evaluation/scoring. Perhaps this is
something that can form a basis for such a new undertaking.


I agree that distributed models are only really applicable in the case of
massive scale factor models - and then anyway for latency purposes one
needs to use LSH or something similar to achieve sufficiently real-time
performance. These days one can easily spin up a single very powerful
server to handle even very large models.

On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:

> I was thinking about to work on better version of PMML, JMML in JSON, but
> as you said, this requires a dedicated team to define the standard which
> will be a huge work.  However, option b) and c) still don't address the
> distributed models issue. In fact, most of the models in production have to
> be small enough to return the result to users within reasonable latency, so
> I doubt how usefulness of the distributed models in real production
> use-case. For R and Python, we can build a wrapper on-top of the
> lightweight "spark-ml-common" project.
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> I think the issue with pulling in all of spark-core is often with
>> dependencies (and versions) conflicting with the web framework (or Akka in
>> many cases). Plus it really is quite heavy if you just want a fairly
>> lightweight model-serving app. For example we've built a fairly simple but
>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
>> really need is the web framework and Breeze (or an alternative linear
>> algebra lib).
>>
>> I definitely hear the pain-point that PMML might not be able to handle
>> some types of transformations or models that exist in Spark. However,
>> here's an example from scikit-learn -> PMML that may be instructive (
>> https://github.com/scikit-learn/scikit-learn/issues/1596 and
>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list
>> of estimators and transformers are supported (including e.g. scaling and
>> encoding, and PCA).
>>
>> I definitely think the current model I/O and "export" or "deploy to
>> production" situation needs to be improved substantially. However, you are
>> left with the following options:
>>
>> (a) build out a lightweight "spark-ml-common" project that brings in the
>> dependencies needed for production scoring / transformation in independent
>> apps. However, here you only support Scala/Java - what about R and Python?
>> Also, what about the distributed models? Perhaps "local" wrappers can be
>> created, though this may not work for very large factor or LDA models. See
>> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>>
>> (b) build out Spark's PMML support, and add missing stuff to PMML where
>> possible. The benefit here is an existing standard with various tools for
>> scoring (via REST server, Java app, Pig, Hive, various language support).
>>
>> (c) build out a more comprehensive I/O, serialization and scoring
>> framework. Here you face the issue of supporting various predictors and
>> transformers generically, across platforms and versioning. i.e. you're
>> re-creating a new standard like PMML
>>
>> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
>> specific", or even too "Scala / Java" specific. But it is still potentially
>> very useful to Spark users to build this out and have a somewhat standard
>> production serving framework and/or library (there are obviously existing
>> options like PredictionIO etc).
>>
>> Option (b) is really building out the existing PMML support within Spark,
>> so a lot of the initial work has already been done. I know some folks had
>> (or have) licensing issues with some components of JPMML (e.g. the
>> evaluator and REST server). But perhaps the solution here is to build an
>> Apache2-licensed evaluator framework.
>>
>> Option (c) is obviously interesting - "let's build a better PMML (that
>> uses JSON or whatever instead of XML!)". But it also seems like a huge
>> amount of reinventing the wheel, and like any new standard would take time
>> to garner wide support (if at all).
>>
>> It would be really useful to start to understand what the main missing
>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>> contribute improvements or additions to PMML.
>>
>>
>>
>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> That may not be an issue if the app using the models runs by itself (not
>>> bundled into an existing app), which may actually be the right way to
>>> design it considering separation of concerns.
>>>
>>> Regards
>>> Sab
>>>
>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>>> This will bring the whole dependencies of spark will may break the web
>>>> app.
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> ----------------------------------------------------------
>>>> Web: https://www.dbtsai.com
>>>> PGP Key ID: 0xAF08DF8D
>>>>
>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>
>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>
>>>>>> Using it does not.
>>>>>>
>>>>>> This is a very useful side effect of ML models.
>>>>>>
>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>>
>>>>>
>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>> the Spark Jars in the classpath though)
>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>
>>>>>>
>>>>>> -------- Original message --------
>>>>>> From: "Kothuvatiparambil, Viju" <
>>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>>> hol...@pigscanfly.ca
>>>>>> Subject: RE: thought experiment: use spark ML to real time prediction
>>>>>>
>>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>>> facing these issues. If we are able to use MLLib to load the model in web
>>>>>> applications (outside the spark cluster), that would have solved the
>>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>>> MLLib,
>>>>>> if we are not able to use it in applications where needs to access the
>>>>>> model.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Viju
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM
>>>>>> *To:* Sean Owen
>>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase;
>>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>>> *Subject:* Re: thought experiment: use spark ML to real time
>>>>>> prediction
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think the use-case can be quick different from PMML.
>>>>>>
>>>>>>
>>>>>>
>>>>>> By having a Spark platform independent ML jar, this can empower users
>>>>>> to do the following,
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a
>>>>>> ML pipeline trained by Spark, most of time, PMML is not expressive enough
>>>>>> to do all the transformation we have in Spark ML. As a result, if we are
>>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>>> load them back in app without any Spark platform for production scorning,
>>>>>> this will be very useful for production deployment of Spark ML models. 
>>>>>> The
>>>>>> only issue will be if the transformer involves with shuffle, we need to
>>>>>> figure out a way to handle it. When I chatted with Xiangrui about this, 
>>>>>> he
>>>>>> suggested that we may tag if a transformer is shuffle ready. Currently, 
>>>>>> at
>>>>>> Netflix, we are not able to use ML pipeline because of those issues, and 
>>>>>> we
>>>>>> have to write our own scorers in our production which is quite a 
>>>>>> duplicated
>>>>>> work.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2) If users can use Spark's linear algebra like vector or matrix code
>>>>>> in their application, this will be very useful. This can help to share 
>>>>>> code
>>>>>> in Spark training pipeline and production deployment. Also, lots of good
>>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can 
>>>>>> use
>>>>>> them in their application without pulling lots of dependencies. In fact, 
>>>>>> in
>>>>>> my project, I have to copy & paste code from mllib into my project to use
>>>>>> those goodies in apps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is
>>>>>> no way to use mllib's vector or matrix. And
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Team Lead - WSO2 Machine Learner
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>>
>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>> Sullivan India ICT)*
>>> +++
>>>
>>
>>
>

Re: thought experiment: use spark ML to real time prediction

Reply via email to