[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/704#issuecomment-104552796 Oh that's true,yes. LGTM then! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/704 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/704#issuecomment-104546413 That's happening because the `Estimator`, `Transformer` and `Predictor` have a default value for the ParameterMap parameter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/704#issuecomment-104546529 If there are no further objections, then I'll merge this PR today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/704#discussion_r30783065 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/StandardScaler.scala --- @@ -22,38 +22,47 @@ import breeze.linalg import breeze.numerics.sqrt import breeze.numerics.sqrt._ import org.apache.flink.api.common.functions._ +import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.scala._ import org.apache.flink.configuration.Configuration -import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} import org.apache.flink.ml.math.Breeze._ -import org.apache.flink.ml.math.Vector +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformOperation, FitOperation, Transformer} import org.apache.flink.ml.preprocessing.StandardScaler.{Mean, Std} +import scala.reflect.ClassTag + /** Scales observations, so that all features have a user-specified mean and standard deviation. * By default for [[StandardScaler]] transformer mean=0.0 and std=1.0. * - * This transformer takes a [[Vector]] of values and maps it to a - * scaled [[Vector]] such that each feature has a user-specified mean and standard deviation. + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature has a user-specified mean and standard + * deviation. * * This transformer can be prepended to all [[Transformer]] and - * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of - * [[Vector]]. + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. * * @example * {{{ *val trainingDS: DataSet[Vector] = env.fromCollection(data) *val transformer = StandardScaler().setMean(10.0).setStd(2.0) * - *transformer.transform(trainingDS) + *transformer.fit(trainingDS) + *val transformedDS = transformer.transform(trainingDS) * }}} * * =Parameters= * - * - [[StandardScaler.Mean]]: The mean value of transformed data set; by default equal to 0 - * - [[StandardScaler.Std]]: The standard deviation of the transformed data set; by default + * - [[Mean]]: The mean value of transformed data set; by default equal to 0 + * - [[Std]]: The standard deviation of the transformed data set; by default * equal to 1 --- End diff -- Why use just the top-level type here, but the fully qualified one in the ALS docstring? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/704#discussion_r30783847 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/StandardScaler.scala --- @@ -22,38 +22,47 @@ import breeze.linalg import breeze.numerics.sqrt import breeze.numerics.sqrt._ import org.apache.flink.api.common.functions._ +import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.scala._ import org.apache.flink.configuration.Configuration -import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} import org.apache.flink.ml.math.Breeze._ -import org.apache.flink.ml.math.Vector +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformOperation, FitOperation, Transformer} import org.apache.flink.ml.preprocessing.StandardScaler.{Mean, Std} +import scala.reflect.ClassTag + /** Scales observations, so that all features have a user-specified mean and standard deviation. * By default for [[StandardScaler]] transformer mean=0.0 and std=1.0. * - * This transformer takes a [[Vector]] of values and maps it to a - * scaled [[Vector]] such that each feature has a user-specified mean and standard deviation. + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature has a user-specified mean and standard + * deviation. * * This transformer can be prepended to all [[Transformer]] and - * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of - * [[Vector]]. + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. * * @example * {{{ *val trainingDS: DataSet[Vector] = env.fromCollection(data) *val transformer = StandardScaler().setMean(10.0).setStd(2.0) * - *transformer.transform(trainingDS) + *transformer.fit(trainingDS) + *val transformedDS = transformer.transform(trainingDS) * }}} * * =Parameters= * - * - [[StandardScaler.Mean]]: The mean value of transformed data set; by default equal to 0 - * - [[StandardScaler.Std]]: The standard deviation of the transformed data set; by default + * - [[Mean]]: The mean value of transformed data set; by default equal to 0 + * - [[Std]]: The standard deviation of the transformed data set; by default * equal to 1 --- End diff -- IntelliJ did it that way. I haven't figured out yet whether the full path is necessary or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/704#issuecomment-104186229 Documentation :white_check_mark: Scaladocs :white_check_mark: Tests :white_check_mark: No code outside of flink-ml touched :white_check_mark: :ship: it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/704#issuecomment-104248691 LGTM as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2050] Introduces new pipelining mechani...
GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/704 [FLINK-2050] Introduces new pipelining mechanism for FlinkML This PR introduces the new pipelining mechanism for FlinkML. In order to make pipeline applicable to different input types, the algorithm logic and the state of the pipeline operator have been separated. The logic is now kept in implicit values which are automatically selected by the Scala compiler based on the input and output types of the pipeline operators and the input data. The operator itself keeps now the model data which is trained in the fit phase. Thus, there is no longer a distinct model which is returned from the algorithm. The pipelining allows, for example, a pipeline which scales vectors to work on the `Vector` type as well as `LabeledVector` type even though both types are not related. The only requirement is that implicit values implementing the algorithm are available. This approach is similar to the mechanism which can be found in the Breeze library. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink pipeline Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/704.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #704 commit 4e5118b10cb7525e19147d49a5fdc6da3aae639c Author: Till Rohrmann trohrm...@apache.org Date: 2015-05-05T13:04:32Z [FLINK-2050] [ml] Introduces new pipelining mechanism using implicit classes to wrap the algorithm logic commit da7d0bfe3a0780b386fcb9b0640513c32ee7bbab Author: Till Rohrmann trohrm...@apache.org Date: 2015-05-20T11:49:52Z [FLINK-2050] [ml] Ports existing ML algorithms to new pipeline mechanism Adds pipeline comments Adds pipeline IT case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---