Thanks for that link Vincenzo. PFA definitely seems interesting - though I see it is quite wide in scope, almost like its own mini math/programming language.
Do you know if there are any reference implementations in code? I don't see any on the web site or the DMG github. On Sun, Nov 22, 2015 at 2:24 PM, Vincenzo Selvaggio <vselvag...@gmail.com> wrote: > The Data Mining Group (http://dmg.org/) that created PMML are working on > a new standard called PFA that indeed uses JSON documents, see > http://dmg.org/pfa/docs/motivation/ for details. > > PFA could be the answer to your option c. > > Regards, > Vincenzo > > > On Wed, Nov 18, 2015 at 12:03 PM, Nick Pentreath <nick.pentre...@gmail.com > > wrote: > >> One such "lightweight PMML in JSON" is here - >> https://github.com/bigmlcom/json-pml. At least for the schema >> definitions. But nothing available in terms of evaluation/scoring. Perhaps >> this is something that can form a basis for such a new undertaking. >> >> I agree that distributed models are only really applicable in the case of >> massive scale factor models - and then anyway for latency purposes one >> needs to use LSH or something similar to achieve sufficiently real-time >> performance. These days one can easily spin up a single very powerful >> server to handle even very large models. >> >> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> >>> I was thinking about to work on better version of PMML, JMML in JSON, >>> but as you said, this requires a dedicated team to define the standard >>> which will be a huge work. However, option b) and c) still don't address >>> the distributed models issue. In fact, most of the models in production >>> have to be small enough to return the result to users within reasonable >>> latency, so I doubt how usefulness of the distributed models in real >>> production use-case. For R and Python, we can build a wrapper on-top of the >>> lightweight "spark-ml-common" project. >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> ---------------------------------------------------------- >>> Web: https://www.dbtsai.com >>> PGP Key ID: 0xAF08DF8D >>> >>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> I think the issue with pulling in all of spark-core is often with >>>> dependencies (and versions) conflicting with the web framework (or Akka in >>>> many cases). Plus it really is quite heavy if you just want a fairly >>>> lightweight model-serving app. For example we've built a fairly simple but >>>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you >>>> really need is the web framework and Breeze (or an alternative linear >>>> algebra lib). >>>> >>>> I definitely hear the pain-point that PMML might not be able to handle >>>> some types of transformations or models that exist in Spark. However, >>>> here's an example from scikit-learn -> PMML that may be instructive ( >>>> https://github.com/scikit-learn/scikit-learn/issues/1596 and >>>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive >>>> list of estimators and transformers are supported (including e.g. scaling >>>> and encoding, and PCA). >>>> >>>> I definitely think the current model I/O and "export" or "deploy to >>>> production" situation needs to be improved substantially. However, you are >>>> left with the following options: >>>> >>>> (a) build out a lightweight "spark-ml-common" project that brings in >>>> the dependencies needed for production scoring / transformation in >>>> independent apps. However, here you only support Scala/Java - what about R >>>> and Python? Also, what about the distributed models? Perhaps "local" >>>> wrappers can be created, though this may not work for very large factor or >>>> LDA models. See also H20 example >>>> http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html >>>> >>>> (b) build out Spark's PMML support, and add missing stuff to PMML where >>>> possible. The benefit here is an existing standard with various tools for >>>> scoring (via REST server, Java app, Pig, Hive, various language support). >>>> >>>> (c) build out a more comprehensive I/O, serialization and scoring >>>> framework. Here you face the issue of supporting various predictors and >>>> transformers generically, across platforms and versioning. i.e. you're >>>> re-creating a new standard like PMML >>>> >>>> Option (a) is do-able, but I'm a bit concerned that it may be too >>>> "Spark specific", or even too "Scala / Java" specific. But it is still >>>> potentially very useful to Spark users to build this out and have a >>>> somewhat standard production serving framework and/or library (there are >>>> obviously existing options like PredictionIO etc). >>>> >>>> Option (b) is really building out the existing PMML support within >>>> Spark, so a lot of the initial work has already been done. I know some >>>> folks had (or have) licensing issues with some components of JPMML (e.g. >>>> the evaluator and REST server). But perhaps the solution here is to build >>>> an Apache2-licensed evaluator framework. >>>> >>>> Option (c) is obviously interesting - "let's build a better PMML (that >>>> uses JSON or whatever instead of XML!)". But it also seems like a huge >>>> amount of reinventing the wheel, and like any new standard would take time >>>> to garner wide support (if at all). >>>> >>>> It would be really useful to start to understand what the main missing >>>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to >>>> contribute improvements or additions to PMML. >>>> >>>> >>>> >>>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan < >>>> sabarish.sasidha...@manthan.com> wrote: >>>> >>>>> That may not be an issue if the app using the models runs by itself >>>>> (not bundled into an existing app), which may actually be the right way to >>>>> design it considering separation of concerns. >>>>> >>>>> Regards >>>>> Sab >>>>> >>>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote: >>>>> >>>>>> This will bring the whole dependencies of spark will may break the >>>>>> web app. >>>>>> >>>>>> >>>>>> Sincerely, >>>>>> >>>>>> DB Tsai >>>>>> ---------------------------------------------------------- >>>>>> Web: https://www.dbtsai.com >>>>>> PGP Key ID: 0xAF08DF8D >>>>>> >>>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote: >>>>>>> >>>>>>>> I agree 100%. Making the model requires large data and many cpus. >>>>>>>> >>>>>>>> Using it does not. >>>>>>>> >>>>>>>> This is a very useful side effect of ML models. >>>>>>>> >>>>>>>> If mlib can't use models outside spark that's a real shame. >>>>>>>> >>>>>>> >>>>>>> Well you can as mentioned earlier. You don't need Spark runtime for >>>>>>> predictions, save the serialized model and deserialize to use. (you need >>>>>>> the Spark Jars in the classpath though) >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>>>>> >>>>>>>> >>>>>>>> -------- Original message -------- >>>>>>>> From: "Kothuvatiparambil, Viju" < >>>>>>>> viju.kothuvatiparam...@bankofamerica.com> >>>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00) >>>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> >>>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando < >>>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, >>>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" < >>>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>, >>>>>>>> hol...@pigscanfly.ca >>>>>>>> Subject: RE: thought experiment: use spark ML to real time >>>>>>>> prediction >>>>>>>> >>>>>>>> I am glad to see DB’s comments, make me feel I am not the only one >>>>>>>> facing these issues. If we are able to use MLLib to load the model in >>>>>>>> web >>>>>>>> applications (outside the spark cluster), that would have solved the >>>>>>>> issue. I understand Spark is manly for processing big data in a >>>>>>>> distributed mode. But, there is no purpose in training a model using >>>>>>>> MLLib, >>>>>>>> if we are not able to use it in applications where needs to access the >>>>>>>> model. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Viju >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com] >>>>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM >>>>>>>> *To:* Sean Owen >>>>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; >>>>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca >>>>>>>> *Subject:* Re: thought experiment: use spark ML to real time >>>>>>>> prediction >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think the use-case can be quick different from PMML. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> By having a Spark platform independent ML jar, this can empower >>>>>>>> users to do the following, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for >>>>>>>> a ML pipeline trained by Spark, most of time, PMML is not expressive >>>>>>>> enough >>>>>>>> to do all the transformation we have in Spark ML. As a result, if we >>>>>>>> are >>>>>>>> able to serialize the entire Spark ML pipeline after training, and then >>>>>>>> load them back in app without any Spark platform for production >>>>>>>> scorning, >>>>>>>> this will be very useful for production deployment of Spark ML models. >>>>>>>> The >>>>>>>> only issue will be if the transformer involves with shuffle, we need to >>>>>>>> figure out a way to handle it. When I chatted with Xiangrui about >>>>>>>> this, he >>>>>>>> suggested that we may tag if a transformer is shuffle ready. >>>>>>>> Currently, at >>>>>>>> Netflix, we are not able to use ML pipeline because of those issues, >>>>>>>> and we >>>>>>>> have to write our own scorers in our production which is quite a >>>>>>>> duplicated >>>>>>>> work. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2) If users can use Spark's linear algebra like vector or matrix >>>>>>>> code in their application, this will be very useful. This can help to >>>>>>>> share >>>>>>>> code in Spark training pipeline and production deployment. Also, lots >>>>>>>> of >>>>>>>> good stuff at Spark's mllib doesn't depend on Spark platform, and >>>>>>>> people >>>>>>>> can use them in their application without pulling lots of >>>>>>>> dependencies. In >>>>>>>> fact, in my project, I have to copy & paste code from mllib into my >>>>>>>> project >>>>>>>> to use those goodies in apps. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there >>>>>>>> is no way to use mllib's vector or matrix. And >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Team Lead - WSO2 Machine Learner >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Architect - Big Data >>>>> Ph: +91 99805 99458 >>>>> >>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and >>>>> Sullivan India ICT)* >>>>> +++ >>>>> >>>> >>>> >>> >> >