[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181772422 That said, just for a comparison purpose, spark has its own model export and import feature, along with pmml export. Hoping to fully support pmml import in a framework like flink or spark is a next to impossible thing which requires changes to the entire way our pipelines and datasets and represented. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chobeat commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181790876 I agree with @sachingoel0101 on the import complexity but, from our point of view, Flink is the perfect platform to evaluate models in streaming and we are using it that way in our architecture. Why do you think it wouldn't be suitable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181771679 As the original author of this PR, I'd say this: I tried implementing the import features but they aren't worth it. You have to discard most of the valid pmml models because they don't fit in with the flink framework. Further, in my opinion, the use of flink is to train the model. Once we export that model in pmml, you can use it pretty much anywhere, say R or matlab, which support a complete pmml import and export functionality. The exported model is in most cases going to be used for testing, evaluating and predictions purposes, for which flink isn't a good platform to use anyway. This can be accomplished anywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chobeat commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181799783 @sachingoel0101 I agree. Nonetheless, an easy way to store and move a model generated in batch to a streaming enviroment would be a really useful feature and we go back to what @chiwanpark was saying about a custom format internal to Flink. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181798643 That is a good point. In streaming setting, it does indeed make sense for the model to be available. However, in my opinion, then it would make sense to actually just use jppml and import the object, followed by extracting the model parameters. Granted, it is an added effort on the user side, but I still think it beats the complexity introduced by supporting imports directly. Furthermore, it would be a bad design to have to reject valid pmml models, just because a minor thing isn't supported in Flink. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181803637 I'm all for that. Flink's models should be transferable at least across flink. But that should be part of a separate PR, and not block this one as it has been for far too long. It should be pretty easy to accomplish --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chobeat commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181757426 Well that wouldn't be a problem for the export: you will create and therefore export only models that have `double` as datatype for parameters but that's not an issue. This would be a problem for import though because PMML does support a wider set of data types and model types but you can't really achieve any satisfying degree of support for PMML in a platform like Flink and that's why everyone use JPMML for evaluation. You will be able to only import compatible models with compatible data fields. This would require a simple validation at runtime on the model type and on fields' data types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chobeat commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181442715 Hello, any news on this PR? @smarthi PMML is actually an industry standard and widely used to support model portability in complex infrastructures. Assuming that is not adopted is a wrong assumption according to my knowledge and experience. There are for sure a lot of data scientists that never get in contact with this standard and I had never heard of it before my first job on a ML architecture but it's the best (and only) tool for this kind of job. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181739512 Hi @chobeat, thanks for leaving your comments. About compatibility with other system (such as R or MLlib), I meant that we cannot achieve compatibility with the systems even though we use PMML because there is difference between FlinkML and other systems. For example, FlinkML supports only `Double` as a data type. So we can achieve only partial support of PMML (especially importing model from the other systems). Is this sufficient to use in production? If yes, we would go for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181509602 Hi @chobeat, thanks for pinging this issue. I forgot sending a discuss email to mailing thread. I think we have to discuss about followings: * What is main purpose to support PMML? Is this feature for only model portability in FlinkML? If not, we have to support other systems such as R or Spark MLlib. * What about FlinkML only format? I think that support for distributed system in PMML is poor. XML-based format is hard to parallelize. I would like to create a general ML model importing/exporting framework. Then, we can easily add the PMML support based on the framework. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chobeat commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-181578375 Hi @chiwanpark, > What is main purpose to support PMML? Is this feature for only model portability in FlinkML? I've used PMML extensively in a previous project and saw many application cases other than my own. PMML export is necessary for external portability: you may need to create a model in Flink and use it on local data using a data mining tool for example, or you could deploy it in a production pipeline developed with a totally different technological stack. PMML import is optional though: you can use JPMML (the reference implementation of PMML) to read a PMML file and perform the evaluation of the model locally to the node. Import from PMML to the native implementation of FlinkML may be a plus in terms of usability and probably performance but it's not really a blocking issue for a developer. > If not, we have to support other systems such as R or Spark MLlib. Support for R may be interesting by itself but I can't understand what do you mean. MLlib does support PMML export (even if somewhat bugged for a few models like Naive Bayes) so it is already possible to move models from MLlib to Flink. >What about FlinkML only format? I think that support for distributed system in PMML is poor. XML-based format is hard to parallelize. This could be interesting to guarantee the consistency of the models and to tune it to our needs. The complexity of PMML is due to the need of generality and consistency but it's often an overkill to describe simple models. Also it has only partial support for many models that we may want to implement: i.e. any of the online learning algorithms implemented in SAMOA or other online learning frameworks. I know we still miss a few pieces before reaching that point, but still... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user smarthi commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-152265442 Suggest that you see how PMML been's done on Oryx 2.0 (PMML in Spark followed Oryx 2.0). PMML support was discussed various times on the Mahout project and was never implemented in large part due to lack of actual PMML usage by Machine Learning Practitioners and Data Scientists. See this Mahout thread from last year and more specifically to Ted Dunning's comment in the thread - http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E Given that PMML models could possibly get real huge, its a good practice to persist them in compressed format. It would also be good to be able to specify which features/fields are categorical/numeric (via a config file maybe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1186#discussion_r41496701 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala --- @@ -18,23 +18,20 @@ package org.apache.flink.ml.classification -import org.apache.flink.api.common.typeinfo.TypeInformation -import org.apache.flink.ml.pipeline.{PredictOperation, FitOperation, PredictDataSetOperation, -Predictor} - -import scala.collection.mutable.ArrayBuffer -import scala.util.Random - +import breeze.linalg.{DenseVector => BreezeDenseVector, Vector => BreezeVector} import org.apache.flink.api.common.functions.RichMapFunction import org.apache.flink.api.scala._ import org.apache.flink.configuration.Configuration -import org.apache.flink.ml._ import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner -import org.apache.flink.ml.common._ -import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.ml.common.{Parameter => FlinkParameter, _} --- End diff -- I would like to preserve `Parameter` for Flink. How about renaming classes for PMML library? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1186#discussion_r41497880 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/regression/MultipleLinearRegression.scala --- @@ -124,6 +121,52 @@ class MultipleLinearRegression extends Predictor[MultipleLinearRegression] { } } + + override def toPMML(): PMML = { +weightsOption match { + case None => { +throw new RuntimeException("The MultipleLinearRegression has not been fitted to the " + + "data. This is necessary to learn the weight vector of the linear function.") + } + case Some(weights) => { +val model = weights.collect().head +val pmml = new PMML() +pmml.setHeader(new Header().setDescription("Multiple Linear Regression")) + +// define the fields +val target = FieldName.create("prediction") +val fields = scala.Array.ofDim[FieldName](model.weights.size) +Range(0, model.weights.size).foreach(index => + fields(index) = FieldName.create("field_" + index) +) --- End diff -- We can make this more scalaesque: ```scala val fields = (0 until model.weights.size).map(i => FieldName.create("field_" + i.toString)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1186#discussion_r41497844 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/regression/MultipleLinearRegression.scala --- @@ -18,15 +18,12 @@ package org.apache.flink.ml.regression -import org.apache.flink.api.scala.DataSet +import org.apache.flink.api.scala.{DataSet, _} --- End diff -- Unnecessary import statement change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-146501909 The PMML model is quite extensive, and there isn't enough support in the ML library for utilizing most of the things [like FieldUsageType, DataTypes etc.]. I had actually written the import functions for both SVM and MLR but decided to drop them. I mostly followed Spark's implementation for this, and it isn't supported there either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1186#discussion_r41498327 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/regression/MultipleLinearRegression.scala --- @@ -124,6 +121,52 @@ class MultipleLinearRegression extends Predictor[MultipleLinearRegression] { } } + + override def toPMML(): PMML = { +weightsOption match { + case None => { +throw new RuntimeException("The MultipleLinearRegression has not been fitted to the " + + "data. This is necessary to learn the weight vector of the linear function.") + } + case Some(weights) => { +val model = weights.collect().head +val pmml = new PMML() +pmml.setHeader(new Header().setDescription("Multiple Linear Regression")) + +// define the fields +val target = FieldName.create("prediction") +val fields = scala.Array.ofDim[FieldName](model.weights.size) +Range(0, model.weights.size).foreach(index => + fields(index) = FieldName.create("field_" + index) +) + +// define the data dictionary, mining schema and regression table +val dictionary = new DataDictionary() +val miningSchema = new MiningSchema() +val regressionTable = new RegressionTable().setIntercept(model.intercept) +Range(0, model.weights.size).foreach(index => { + miningSchema.addMiningFields( +new MiningField(fields(index)).setUsageType(FieldUsageType.ACTIVE) + ) + regressionTable.addNumericPredictors( +new NumericPredictor(fields(index), model.weights(index)) + ) + dictionary.addDataFields( +new DataField(fields(index), OpType.CONTINUOUS, DataType.DOUBLE) + ) +}) +dictionary.addDataFields(new DataField(target, OpType.CONTINUOUS, DataType.DOUBLE)) +miningSchema.addMiningFields(new MiningField(target).setUsageType(FieldUsageType.PREDICTED)) + +// define the model +val pmmlModel = new RegressionModel() + .setFunctionName(MiningFunctionType.REGRESSION) --- End diff -- Maybe we should add `.setModelType(RegressionModel.ModelType.LINEAR_REGRESSION)` after this line for future of other regression model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1186#discussion_r41497773 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/regression/MultipleLinearRegression.scala --- @@ -124,6 +121,52 @@ class MultipleLinearRegression extends Predictor[MultipleLinearRegression] { } } + + override def toPMML(): PMML = { +weightsOption match { + case None => { +throw new RuntimeException("The MultipleLinearRegression has not been fitted to the " + + "data. This is necessary to learn the weight vector of the linear function.") + } + case Some(weights) => { +val model = weights.collect().head +val pmml = new PMML() +pmml.setHeader(new Header().setDescription("Multiple Linear Regression")) + +// define the fields +val target = FieldName.create("prediction") +val fields = scala.Array.ofDim[FieldName](model.weights.size) +Range(0, model.weights.size).foreach(index => + fields(index) = FieldName.create("field_" + index) +) + +// define the data dictionary, mining schema and regression table +val dictionary = new DataDictionary() +val miningSchema = new MiningSchema() +val regressionTable = new RegressionTable().setIntercept(model.intercept) +Range(0, model.weights.size).foreach(index => { + miningSchema.addMiningFields( +new MiningField(fields(index)).setUsageType(FieldUsageType.ACTIVE) + ) + regressionTable.addNumericPredictors( +new NumericPredictor(fields(index), model.weights(index)) + ) + dictionary.addDataFields( +new DataField(fields(index), OpType.CONTINUOUS, DataType.DOUBLE) + ) +}) --- End diff -- We can simplify this using `zipWithIndex` method for `fields`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1186#issuecomment-146570848 Okay, We need some discussion in mailing list about ML model import/export feature. I think that PMML support is one of sub-issues related to the ML model import/export issue. I'll post the discussion thread in few days. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1966][ml]Add support for Predictive Mod...
GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/1186 [FLINK-1966][ml]Add support for Predictive Model Markup Language 1. Adds an interface to allow exporting of models to PMML format. 2. Implements export methods for the existing SVM and Regression algorithms. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink pmml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1186 commit a71640edd83b6fd1085935496c1dd2553bd42caa Author: Sachin GoelDate: 2015-09-27T13:04:17Z [FLINK-1966][ml]Add support for Predictive Model Markup Language --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---