One such "lightweight PMML in JSON" is here - https://github.com/bigmlcom/json-pml. At least for the schema definitions. But nothing available in terms of evaluation/scoring. Perhaps this is something that can form a basis for such a new undertaking.
I agree that distributed models are only really applicable in the case of massive scale factor models - and then anyway for latency purposes one needs to use LSH or something similar to achieve sufficiently real-time performance. These days one can easily spin up a single very powerful server to handle even very large models. On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: > I was thinking about to work on better version of PMML, JMML in JSON, but > as you said, this requires a dedicated team to define the standard which > will be a huge work. However, option b) and c) still don't address the > distributed models issue. In fact, most of the models in production have to > be small enough to return the result to users within reasonable latency, so > I doubt how usefulness of the distributed models in real production > use-case. For R and Python, we can build a wrapper on-top of the > lightweight "spark-ml-common" project. > > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Web: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> I think the issue with pulling in all of spark-core is often with >> dependencies (and versions) conflicting with the web framework (or Akka in >> many cases). Plus it really is quite heavy if you just want a fairly >> lightweight model-serving app. For example we've built a fairly simple but >> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you >> really need is the web framework and Breeze (or an alternative linear >> algebra lib). >> >> I definitely hear the pain-point that PMML might not be able to handle >> some types of transformations or models that exist in Spark. However, >> here's an example from scikit-learn -> PMML that may be instructive ( >> https://github.com/scikit-learn/scikit-learn/issues/1596 and >> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list >> of estimators and transformers are supported (including e.g. scaling and >> encoding, and PCA). >> >> I definitely think the current model I/O and "export" or "deploy to >> production" situation needs to be improved substantially. However, you are >> left with the following options: >> >> (a) build out a lightweight "spark-ml-common" project that brings in the >> dependencies needed for production scoring / transformation in independent >> apps. However, here you only support Scala/Java - what about R and Python? >> Also, what about the distributed models? Perhaps "local" wrappers can be >> created, though this may not work for very large factor or LDA models. See >> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html >> >> (b) build out Spark's PMML support, and add missing stuff to PMML where >> possible. The benefit here is an existing standard with various tools for >> scoring (via REST server, Java app, Pig, Hive, various language support). >> >> (c) build out a more comprehensive I/O, serialization and scoring >> framework. Here you face the issue of supporting various predictors and >> transformers generically, across platforms and versioning. i.e. you're >> re-creating a new standard like PMML >> >> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark >> specific", or even too "Scala / Java" specific. But it is still potentially >> very useful to Spark users to build this out and have a somewhat standard >> production serving framework and/or library (there are obviously existing >> options like PredictionIO etc). >> >> Option (b) is really building out the existing PMML support within Spark, >> so a lot of the initial work has already been done. I know some folks had >> (or have) licensing issues with some components of JPMML (e.g. the >> evaluator and REST server). But perhaps the solution here is to build an >> Apache2-licensed evaluator framework. >> >> Option (c) is obviously interesting - "let's build a better PMML (that >> uses JSON or whatever instead of XML!)". But it also seems like a huge >> amount of reinventing the wheel, and like any new standard would take time >> to garner wide support (if at all). >> >> It would be really useful to start to understand what the main missing >> pieces are in PMML - perhaps the lowest-hanging fruit is simply to >> contribute improvements or additions to PMML. >> >> >> >> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan < >> sabarish.sasidha...@manthan.com> wrote: >> >>> That may not be an issue if the app using the models runs by itself (not >>> bundled into an existing app), which may actually be the right way to >>> design it considering separation of concerns. >>> >>> Regards >>> Sab >>> >>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote: >>> >>>> This will bring the whole dependencies of spark will may break the web >>>> app. >>>> >>>> >>>> Sincerely, >>>> >>>> DB Tsai >>>> ---------------------------------------------------------- >>>> Web: https://www.dbtsai.com >>>> PGP Key ID: 0xAF08DF8D >>>> >>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote: >>>>> >>>>>> I agree 100%. Making the model requires large data and many cpus. >>>>>> >>>>>> Using it does not. >>>>>> >>>>>> This is a very useful side effect of ML models. >>>>>> >>>>>> If mlib can't use models outside spark that's a real shame. >>>>>> >>>>> >>>>> Well you can as mentioned earlier. You don't need Spark runtime for >>>>> predictions, save the serialized model and deserialize to use. (you need >>>>> the Spark Jars in the classpath though) >>>>> >>>>>> >>>>>> >>>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>>> >>>>>> >>>>>> -------- Original message -------- >>>>>> From: "Kothuvatiparambil, Viju" < >>>>>> viju.kothuvatiparam...@bankofamerica.com> >>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00) >>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> >>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando < >>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, >>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" < >>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>, >>>>>> hol...@pigscanfly.ca >>>>>> Subject: RE: thought experiment: use spark ML to real time prediction >>>>>> >>>>>> I am glad to see DB’s comments, make me feel I am not the only one >>>>>> facing these issues. If we are able to use MLLib to load the model in web >>>>>> applications (outside the spark cluster), that would have solved the >>>>>> issue. I understand Spark is manly for processing big data in a >>>>>> distributed mode. But, there is no purpose in training a model using >>>>>> MLLib, >>>>>> if we are not able to use it in applications where needs to access the >>>>>> model. >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> Viju >>>>>> >>>>>> >>>>>> >>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com] >>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM >>>>>> *To:* Sean Owen >>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; >>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca >>>>>> *Subject:* Re: thought experiment: use spark ML to real time >>>>>> prediction >>>>>> >>>>>> >>>>>> >>>>>> I think the use-case can be quick different from PMML. >>>>>> >>>>>> >>>>>> >>>>>> By having a Spark platform independent ML jar, this can empower users >>>>>> to do the following, >>>>>> >>>>>> >>>>>> >>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a >>>>>> ML pipeline trained by Spark, most of time, PMML is not expressive enough >>>>>> to do all the transformation we have in Spark ML. As a result, if we are >>>>>> able to serialize the entire Spark ML pipeline after training, and then >>>>>> load them back in app without any Spark platform for production scorning, >>>>>> this will be very useful for production deployment of Spark ML models. >>>>>> The >>>>>> only issue will be if the transformer involves with shuffle, we need to >>>>>> figure out a way to handle it. When I chatted with Xiangrui about this, >>>>>> he >>>>>> suggested that we may tag if a transformer is shuffle ready. Currently, >>>>>> at >>>>>> Netflix, we are not able to use ML pipeline because of those issues, and >>>>>> we >>>>>> have to write our own scorers in our production which is quite a >>>>>> duplicated >>>>>> work. >>>>>> >>>>>> >>>>>> >>>>>> 2) If users can use Spark's linear algebra like vector or matrix code >>>>>> in their application, this will be very useful. This can help to share >>>>>> code >>>>>> in Spark training pipeline and production deployment. Also, lots of good >>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can >>>>>> use >>>>>> them in their application without pulling lots of dependencies. In fact, >>>>>> in >>>>>> my project, I have to copy & paste code from mllib into my project to use >>>>>> those goodies in apps. >>>>>> >>>>>> >>>>>> >>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is >>>>>> no way to use mllib's vector or matrix. And >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Team Lead - WSO2 Machine Learner >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Architect - Big Data >>> Ph: +91 99805 99458 >>> >>> Manthan Systems | *Company of the year - Analytics (2014 Frost and >>> Sullivan India ICT)* >>> +++ >>> >> >> >