[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132552269 Sounds good, just make sure to update the title and description of both PRs to reflect the current state --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132512002 Okay. Done. I will modify the `createContinuousHistogram` to use `DataStats` in #1032, since that is going to be merged after this anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37390659 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,98 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. **Continuous Histograms**: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, *scaled* probability]. + 2. A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. +**Discrete Histograms**: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A discrete histogram can be formed by calling `MLUtils.createDiscreteHistogram(X)`. + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either *discrete* or + *continuous*. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be --- End diff -- Wrong position of backtick symbol. \`DataStats.dataStats(X, discreteFields)\` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132494738 Good! But because there is another issue (FLINK-2379) covered the column-wise statistics, I think that spliting them would be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132128284 +1 for having two different methods by return type but we need more comments from @tillrohrmann, @thvasilo or other people because I'm not sure this is best approach. Would be okay the method names are `createContinuousHistogram` and `createCategoricalHistogram` if we decide create two methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132146208 Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132170286 @thvasilo , @chiwanpark , I've made the required changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132118352 What about having two different functions? One for Discrete and one for continuous? Or perhaps just one for the Continuous as that is more likely to be used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132134747 Okay. That sounds good enough. Can we make a final decision then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132116059 I think that we can merge this PR after we decide the return type of `createHistogram` method. Any other points seem okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132187018 Olrite. I'll be back in a few. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132183402 Is there the valid reason to have multiple names ('discrete histogram', 'categorical histogram')? I think that to avoid confusion, we need to unify the names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132188989 @chiwanpark , modified the names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132186504 I'm also inclined to use Discrete. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132193111 They're not coupled at all. But both are related to statistics over data sets. This is why I combined them both in one. If you're wondering, there is a JIRA for the column wise statistics as well. [FLINK-2379]. Unless it is absolutely necessary, I'd like to keep them both in one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132439907 Including the collect step in the documentation seems weird, IMO. Since the return type is a `DataSet[Histogram]`, anyone familiar with flink will know to collect it first before operating on the element itself. Of course, one other option would be actually collect and return the histogram in the `create` functions, but we don't wanna do that in case it is used inside an iteration. Should I include the collect step in the documentation then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132430686 The only problem to merge this is invalid example in documentation. We need calling `collect()` method and `apply(0)` to execute statistics methods such as `quantile`, `sum`, ..., etc.. Other things seem okay. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132449023 Hi @chiwanpark , I have modified the documentation to include the collect step. I have also made a few modification to how the histogram is created by using a `MapPartition` function instead of `Map`. Also, for more consistency, changed the signature of `DataStats` creation to return `DataSet[DataStats]` instead of `DataStats`. @thvasilo , I have integrated the statistics part with this. I hope that is alright. Please have a look at `MLUtils.scala` to see the usage. If it's not, I'd be happy to remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132209035 I have kept the documentation for statistics in this however. We can merge #1032 first and then this to maintain the illusion of cause-effect. :') --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132208131 @thvasilo , @chiwanpark , I've split this into two parts. The statistics part is now in #1032 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132264800 Haha. It's no problem. I've updated the documentation for this. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-132261492 Sorry to be a nitpick, but if we are going to split the PR then the documentation should be split accordingly, we can merge the column-wise statistics once this one is merged. I know this is taking very long to merge, but it is better if we do things properly and not rush them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37143932 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + --- End diff -- I meant that the space between `private` and `[statistics]` should be removed. It is just cosmetic issue but to keep unified code style, we need fix this. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131512180 @sachingoel0101 There is no need to forgive. I just reviewed this PR and left my opinions. :) If you feel that my comments is aggressive, I'm sorry about that. About cosmetic issues, It seems okay except `MLUtils.createHistogram` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144073 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala --- @@ -119,4 +120,31 @@ object MLUtils { stringRepresentation.writeAsText(filePath) } + + /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data +* +* @param bins Number of bins required. Zero for [[CategoricalHistogram]] +* @param data input [[DataSet]] of [[Double]] +* @return [[DataSet]] of [[OnlineHistogram]] +*/ + private [ml] def createHistogram(data: DataSet[Double], bins: Int): DataSet[OnlineHistogram] = { +val min = data.reduce((x, y) = Math.min(x, y)) +val max = data.reduce((x, y) = Math.max(x, y)) + +val stats = min.mapWithBcVariable(max){ + (minimum,maximum) = (minimum - 2 * (maximum - minimum), maximum + 2 * (maximum - minimum)) +} +val ret: DataSet[OnlineHistogram] = data.mapWithBcVariable(stats){ (x, statistics) = + if(bins 0){ +val h = new ContinuousHistogram(1, statistics._1, statistics._2) +h.loadData(Array((x, 1))) +h + } else{ --- End diff -- Need a space between `else` and `{`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144068 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala --- @@ -119,4 +120,31 @@ object MLUtils { stringRepresentation.writeAsText(filePath) } + + /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data +* +* @param bins Number of bins required. Zero for [[CategoricalHistogram]] +* @param data input [[DataSet]] of [[Double]] +* @return [[DataSet]] of [[OnlineHistogram]] +*/ + private [ml] def createHistogram(data: DataSet[Double], bins: Int): DataSet[OnlineHistogram] = { +val min = data.reduce((x, y) = Math.min(x, y)) +val max = data.reduce((x, y) = Math.max(x, y)) + +val stats = min.mapWithBcVariable(max){ + (minimum,maximum) = (minimum - 2 * (maximum - minimum), maximum + 2 * (maximum - minimum)) +} +val ret: DataSet[OnlineHistogram] = data.mapWithBcVariable(stats){ (x, statistics) = + if(bins 0){ --- End diff -- `if (bins 0) {` would be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131518423 The occasional space issues are because I stopped using Idea's reformat tool. It sometimes messes up the indentations. :') And no. Your comments weren't aggressive. I was apologizing for the trivial oversights in my code. Addressed the latest comments too. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131523559 Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131509499 Hi @chiwanpark , I've fixed the *cosmetic* issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144130 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. strongContinuous Histograms/strong: These histograms are formed on a data set `X: --- End diff -- `strong` tag can be replaced by `**`. `**Continuous Histograms**` is same as `strongContinuous Histograms/strong`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131514958 I found some points to improve and added line notes. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144066 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala --- @@ -119,4 +120,31 @@ object MLUtils { stringRepresentation.writeAsText(filePath) } + + /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data +* +* @param bins Number of bins required. Zero for [[CategoricalHistogram]] +* @param data input [[DataSet]] of [[Double]] +* @return [[DataSet]] of [[OnlineHistogram]] +*/ + private [ml] def createHistogram(data: DataSet[Double], bins: Int): DataSet[OnlineHistogram] = { --- End diff -- `private[ml]` would be better. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144175 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. strongContinuous Histograms/strong: These histograms are formed on a data set `X: + DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br + 2. A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. --- End diff -- `i` tag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144173 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. strongContinuous Histograms/strong: These histograms are formed on a data set `X: + DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br + 2. A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or --- End diff -- `i` tag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144171 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. strongContinuous Histograms/strong: These histograms are formed on a data set `X: + DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br + 2. A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` --- End diff -- `strong` tag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144183 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +data.get(c) match{ + case None = +require(data.size numCategories, Insufficient capacity. Failed to add.) +data.put(c, 1) + case Some(value) = +data.update(c, value + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + finalMap.get(x._1) match{ +case None = finalMap.put(x._1, x._2) +case Some(value) = finalMap.update(x._1, x._2 + value) + } +}) +require(B == 0 || finalMap.size = B, Insufficient capacity. Failed to merge) +var finalSize = finalMap.size +if (B 0) { + finalSize = B +} +val ret = new CategoricalHistogram(finalSize) +ret.loadData(finalMap.toArray) +ret + } + case default = +throw new RuntimeException(Only a categorical histogram is allowed to be merged with a + + categorical histogram) +} + } + + /** Number of elements in category c +* +* @return Number of points in category c +*/ + def count(c: Double): Int = { +data.get(c) match{ --- End diff -- We need a space between `match` and `{`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144182 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +data.get(c) match{ + case None = +require(data.size numCategories, Insufficient capacity. Failed to add.) +data.put(c, 1) + case Some(value) = +data.update(c, value + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + finalMap.get(x._1) match{ +case None = finalMap.put(x._1, x._2) +case Some(value) = finalMap.update(x._1, x._2 + value) + } +}) +require(B == 0 || finalMap.size = B, Insufficient capacity. Failed to merge) +var finalSize = finalMap.size +if (B 0) { + finalSize = B +} --- End diff -- We can change `finalSize` from mutable to immutable. ```scala val finalSize = if (B 0) B else finalMap.size ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144195 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +data.get(c) match{ + case None = +require(data.size numCategories, Insufficient capacity. Failed to add.) +data.put(c, 1) + case Some(value) = +data.update(c, value + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + finalMap.get(x._1) match{ --- End diff -- A space is needed between `match` and `{`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144192 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +data.get(c) match{ --- End diff -- A space is needed between `match` and `{`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144137 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. strongContinuous Histograms/strong: These histograms are formed on a data set `X: + DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. --- End diff -- `i` tag can be replace by `*`. `iscaled/i` can be represented as `*scaled*`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131520764 I just found that `Predef.assume` could be replaced by `assume`. Sorry for wrong guidance. Could you change `Predef.assume` to `assume`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37141049 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum + } + + /** + * Returns the Gini impurity for this [[DISCRETE]] field. + */ + def gini: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total) + } + + //-- Setter methods --// + private [statistics] def setContinuousParameters( --- End diff -- Gini impurity should be accessible to user even outside the package. It is a property of the field itself and we're not using it ourselves anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131474758 Hi @chiwanpark, thanks for reviewing this. This was my first time working in Scala, so I hope you'll forgive the slight mistakes (oversights, perhaps?). I've tried to address most of your comments and left notes where I was not sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37141028 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + --- End diff -- Could you clarify this point? They're already package private fields. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37141076 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,108 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + ul + li + strongContinuous Histograms/strong: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br +A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. + /li + li +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + /li + /ul + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be + declared discrete-valued, and is provided as an array of indices of fields which are discrete. + br + The following information is available as part of `DataStats`: + ul +liNumber of elements in `X`/li +liDimension of `X`/li +liColumn-wise statistics where for discrete fields, we report counts for each category, and + the Gini impurity and Entropy of the field, while for continuous fields, we report the + minimum, maximum, mean and variance. +/li + /ul + +## Examples + +{% highlight scala %} + +import org.apache.flink.ml.statistics._ +import org.apache.flink.ml._ + +val X: DataSet[Double] = ... +// Create continuous histogram +val histogram = X.createHistogram(5) // creates a histogram with five bins +histogram.quantile(0.3) // returns the 30th quantile +histogram.sum(4) // returns number of elements less than 4 --- End diff -- You're right. I'm wondering if we should provide two methods for this purpose. Otherwise user will need to cast into the appropriate class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137570 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala --- @@ -119,4 +120,30 @@ object MLUtils { stringRepresentation.writeAsText(filePath) } + + /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data +* +* @param bins Number of bins required. Zero for [[CategoricalHistogram]] +* @param data input [[DataSet]] of [[Double]] +* @return [[DataSet]] of [[OnlineHistogram]] +*/ + private [ml] def createHistogram(data: DataSet[Double], bins: Int): DataSet[OnlineHistogram] = { +val min = data.map(x = x).reduce((x,y) = Math.min(x,y)) +val max = data.map(x = x).reduce((x,y) = Math.max(x,y)) --- End diff -- We don't need `map(x = x)` operation and need space between `x,` and `y`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137605 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +if (data.get(c).isEmpty) { + require(data.size numCategories, Insufficient capacity. Failed to add.) + data.put(c, 1) +} else { + data.update(c, data.get(c).get + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + if (finalMap.get(x._1).isEmpty) { +finalMap.put(x._1, x._2) + } else { +finalMap.update(x._1, x._2 + finalMap.get(x._1).get) + } +}) --- End diff -- As I said above, we can reduce calling `get(x._1)` by using pattern matching for `finalMap.get(x._1)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137592 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +if (data.get(c).isEmpty) { + require(data.size numCategories, Insufficient capacity. Failed to add.) + data.put(c, 1) +} else { + data.update(c, data.get(c).get + 1) +} --- End diff -- If use pattern matching for `data.get(c)`, we can reduce calling `data.get(c)`. For example: ```scala data.get(c) match { case None = require(data.size numCategories, Insufficient capacity. Failed to add.) data.put(c, 1) case Some(v) = data.update(c, v + 1) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137631 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +if (data.get(c).isEmpty) { + require(data.size numCategories, Insufficient capacity. Failed to add.) + data.put(c, 1) +} else { + data.update(c, data.get(c).get + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + if (finalMap.get(x._1).isEmpty) { +finalMap.put(x._1, x._2) + } else { +finalMap.update(x._1, x._2 + finalMap.get(x._1).get) + } +}) +require(B == 0 || finalMap.size = B, Insufficient capacity. Failed to merge) +val countBuffer = new mutable.ArrayBuffer[(Double, Int)]() +finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2))) +var finalSize = finalMap.size +if (B 0) { + finalSize = B +} +val ret = new CategoricalHistogram(finalSize) +ret.loadData(countBuffer.toArray) --- End diff -- Why the data in `finalMap` should be copied to `countBuffer`? ```scala val finalSize = if (B 0) B else finalMap.size val ret = new CategoricalHistogram(finalSize) ret.loadData(finalMap.toArray) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137647 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +if (data.get(c).isEmpty) { + require(data.size numCategories, Insufficient capacity. Failed to add.) + data.put(c, 1) +} else { + data.update(c, data.get(c).get + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + if (finalMap.get(x._1).isEmpty) { +finalMap.put(x._1, x._2) + } else { +finalMap.update(x._1, x._2 + finalMap.get(x._1).get) + } +}) +require(B == 0 || finalMap.size = B, Insufficient capacity. Failed to merge) +val countBuffer = new mutable.ArrayBuffer[(Double, Int)]() +finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2))) +var finalSize = finalMap.size +if (B 0) { + finalSize = B +} +val ret = new CategoricalHistogram(finalSize) +ret.loadData(countBuffer.toArray) +ret + } + case default = +throw new RuntimeException(Only a categorical histogram is allowed to be merged with a + + categorical histogram) +} + } + + /** Number of elements in category c +* +* @return Number of points in category c +*/ + def count(c: Double): Int = { +if (data.get(c).isEmpty) { + return 0 +} +data.get(c).get + } --- End diff -- I think following is better than current implementation. ```scala def count(c: Double): Int = data.get(c) match { case None = 0 case Some(v) = v } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137658 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.collection.mutable + +/** Implementation of a discrete valued online histogram + * + * =Parameters= + * -[[numCategories]]: + * Number of categories in the histogram + */ +case class CategoricalHistogram(numCategories: Int) extends OnlineHistogram { + + require(numCategories 0, Capacity must be greater than zero) + val data = new mutable.HashMap[Double, Int]() + + /** Number of categories in the histogram +* +* @return number of categories +*/ + override def bins: Int = { +numCategories + } + + /** Increment count of category c +* +* @param c category whose count needs to be incremented +*/ + override def add(c: Double): Unit = { +if (data.get(c).isEmpty) { + require(data.size numCategories, Insufficient capacity. Failed to add.) + data.put(c, 1) +} else { + data.update(c, data.get(c).get + 1) +} + } + + /** Merges the histogram with h and returns a new histogram +* +* @param h histogram to be merged +* @param B number of categories in the resultant histogram. +* (Default: ```0```, number of categories will be the size of union of categories in +* both histograms) +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram = { +h match { + case h1: CategoricalHistogram = { +val finalMap = new mutable.HashMap[Double, Int]() +data.iterator.foreach(x = finalMap.put(x._1, x._2)) +h1.data.iterator.foreach(x = { + if (finalMap.get(x._1).isEmpty) { +finalMap.put(x._1, x._2) + } else { +finalMap.update(x._1, x._2 + finalMap.get(x._1).get) + } +}) +require(B == 0 || finalMap.size = B, Insufficient capacity. Failed to merge) +val countBuffer = new mutable.ArrayBuffer[(Double, Int)]() +finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2))) +var finalSize = finalMap.size +if (B 0) { + finalSize = B +} +val ret = new CategoricalHistogram(finalSize) +ret.loadData(countBuffer.toArray) +ret + } + case default = +throw new RuntimeException(Only a categorical histogram is allowed to be merged with a + + categorical histogram) +} + } + + /** Number of elements in category c +* +* @return Number of points in category c +*/ + def count(c: Double): Int = { +if (data.get(c).isEmpty) { + return 0 +} +data.get(c).get + } + + /** Returns the total number of elements in the histogram +* +* @return total number of elements added so far +*/ + override def total: Int = { +var result = 0 +data.valuesIterator.foreach(x = result += x) +result + } --- End diff -- We can implement `total` method without mutable variable. ```scala override def total: Int = data.valuesIterator.sum ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137718 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137756 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137749 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137752 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137762 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137774 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + --- End diff -- Just for cosmetic, `private[statistics]` would be better to match current coding style. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137811 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum + } + + /** + * Returns the Gini impurity for this [[DISCRETE]] field. + */ + def gini: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total) + } + + //-- Setter methods --// + private [statistics] def setContinuousParameters( + min: Double, + max: Double, + mean: Double, + variance: Double) +: Unit = { +exception(DISCRETE) +_min = min +_max = max +_mean = mean +_variance = variance + } + + private [statistics] def setDiscreteParameters(counts: mutable.HashMap[Double,Int]): Unit = { +exception(CONTINUOUS) +_counts = counts + } + + private def exception(checkFor: FieldType) = { +if(fieldType == checkFor){ + throw new RuntimeException(Invalid access of data. Check field types.) +} + } --- End diff -- Maybe If we check `FieldType` in each method (`min`, `max`, `mean`, ..., `gini`) with `Predef.assume` method, the user can understand what is the problem more properly. For example, when the user call `gini` method for continuous histogram, we can throw exception with detailed message such as Gini impurity for continuous histogram is not supported.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137819 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum + } + + /** + * Returns the Gini impurity for this [[DISCRETE]] field. + */ + def gini: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total) + } + + //-- Setter methods --// + private [statistics] def setContinuousParameters( --- End diff -- Also `private[statistics]` would be better in here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137821 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum + } + + /** + * Returns the Gini impurity for this [[DISCRETE]] field. + */ + def gini: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total) + } + + //-- Setter methods --// + private [statistics] def setContinuousParameters( + min: Double, + max: Double, + mean: Double, + variance: Double) +: Unit = { +exception(DISCRETE) +_min = min +_max = max +_mean = mean +_variance = variance + } + + private [statistics] def setDiscreteParameters(counts: mutable.HashMap[Double,Int]): Unit = { --- End diff -- Also `private[statistics]` would be better in here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137856 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) +i = i + 1 + } else { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } +} +// the size will be brought to capacity while constructing the new histogram itself +val finalLower = Math.min(lower, temp.lower) +val finalUpper = Math.max(upper, temp.upper) +val ret = new ContinuousHistogram(B, finalLower, finalUpper) +ret.loadData(mergeList.toArray) +ret + } + case default = +throw new RuntimeException(Only a continuous histogram is allowed to be merged with a + + continuous histogram) + +} + } + + /** Returns the qth quantile of the histogram +* +* @param q Quantile value in (0,1) +* @return Value at quantile q +*/ + def quantile(q: Double): Double = { +require(bins 0, Histogram is empty) +require(q 0 q 1, Quantile must be between 0 and 1) +var total = 0 +for (i - 0 to bins - 1) { + total = total + getCounter(i) +} +val wantedSum = (q * total).round.toInt +var currSum = count(getValue(0)) + +if (wantedSum currSum) { + require(lower -MaxValue, Set a lower bound before
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137866 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum + } + + /** + * Returns the Gini impurity for this [[DISCRETE]] field. + */ + def gini: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total) --- End diff -- Unnecessary parentheses in `(x._2 * x._2)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37137965 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import org.apache.flink.ml.statistics.FieldType._ + +import scala.collection.mutable + +/** Class to represent Field statistics. + * + * =Parameters= + * -[[fieldType]]: + * Type of this field, [[DISCRETE]] or [[CONTINUOUS]] + * + * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] are provided. + * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] are provided. + * + */ +class FieldStats(val fieldType: FieldType) extends Serializable { + // field parameters + private [statistics] var _min: Double = _ + private [statistics] var _max: Double = _ + private [statistics] var _mean: Double = _ + private [statistics] var _variance: Double = _ + private [statistics] var _counts: mutable.HashMap[Double,Int] = _ + + // Access methods // + def min: Double = { +exception(DISCRETE) +_min + } + + def max: Double = { +exception(DISCRETE) +_max + } + + def mean: Double = { +exception(DISCRETE) +_mean + } + + def variance: Double = { +exception(DISCRETE) +_variance + } + + def categoryCounts: mutable.HashMap[Double,Int] = { +exception(CONTINUOUS) +_counts + } + + /** + * Returns the entropy value for this [[DISCRETE]] field. + */ + def entropy: Double = { +exception(CONTINUOUS) +val total: Double = _counts.valuesIterator.sum +3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / total)).sum --- End diff -- Is 3.322 used for changing base of log from 10 to 2? Why don't you use `Math.log(~~~) / Math.log(2)`? Saving `Math.log(2)` into an immutable class member is good for performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37138026 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.statistics + +import scala.Double.MaxValue +import scala.collection.mutable + +/** Implementation of a continuous valued online histogram + * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm + * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf + * + * =Parameters= + * -[[capacity]]: + * Number of bins to be used in the histogram + * + * -[[min]]: + * Lower limit on the elements + * + * -[[max]]: + * Upper limit on the elements + */ +case class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends OnlineHistogram { + + require(capacity 0, Capacity should be a positive integer) + require(lower upper, Lower must be less than upper) + + val data = new mutable.ArrayBuffer[(Double, Int)]() + + /** Adds a new item to the histogram +* +* @param p value to be added +*/ + override def add(p: Double): Unit = { +require(p lower p upper, p + not in ( + lower + , + upper + )) +// search for the index where the value is just higher than p +val search = find(p) +// add the new value there, shifting everything to the right +data.insert(search, (p, 1)) +// If we're over capacity or any two elements are within 1e-9 of each other, merge. +// This will take care of the case if p was actually equal to some value in the histogram and +// just increment the value there +mergeElements() + } + + /** Merges the histogram with h and returns a histogram with capacity B +* +* @param h histogram to be merged +* @param B capacity of the resultant histogram +* @return Merged histogram with capacity B +*/ + override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = { +h match { + case temp: ContinuousHistogram = { +val m: Int = bins +val n: Int = temp.bins +var i, j: Int = 0 +val mergeList = new mutable.ArrayBuffer[(Double, Int)]() +while (i m || j n) { + if (i = m) { +mergeList += ((temp.getValue(j), temp.getCounter(j))) +j = j + 1 + } else if (j = n || getValue(i) = temp.getValue(j)) { +mergeList += (data.apply(i)) --- End diff -- Unnecessary parenthesis --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37138063 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,108 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + ul + li + strongContinuous Histograms/strong: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br +A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. + /li + li +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + /li + /ul + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be + declared discrete-valued, and is provided as an array of indices of fields which are discrete. + br + The following information is available as part of `DataStats`: + ul +liNumber of elements in `X`/li +liDimension of `X`/li +liColumn-wise statistics where for discrete fields, we report counts for each category, and + the Gini impurity and Entropy of the field, while for continuous fields, we report the + minimum, maximum, mean and variance. +/li + /ul + +## Examples + +{% highlight scala %} + +import org.apache.flink.ml.statistics._ +import org.apache.flink.ml._ + +val X: DataSet[Double] = ... +// Create continuous histogram +val histogram = X.createHistogram(5) // creates a histogram with five bins +histogram.quantile(0.3) // returns the 30th quantile +histogram.sum(4) // returns number of elements less than 4 + +// Create categorical histogram +val histogram = X.createHistogram(0) // creates a categorical histogram +histogram.count(3) // number of elements with cateogory value 3 --- End diff -- cateogory - category --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37138093 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,108 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + ul + li + strongContinuous Histograms/strong: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br +A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. + /li + li +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + /li + /ul + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be + declared discrete-valued, and is provided as an array of indices of fields which are discrete. + br + The following information is available as part of `DataStats`: + ul +liNumber of elements in `X`/li +liDimension of `X`/li +liColumn-wise statistics where for discrete fields, we report counts for each category, and + the Gini impurity and Entropy of the field, while for continuous fields, we report the + minimum, maximum, mean and variance. +/li + /ul + +## Examples + +{% highlight scala %} + +import org.apache.flink.ml.statistics._ +import org.apache.flink.ml._ + +val X: DataSet[Double] = ... +// Create continuous histogram +val histogram = X.createHistogram(5) // creates a histogram with five bins +histogram.quantile(0.3) // returns the 30th quantile +histogram.sum(4) // returns number of elements less than 4 --- End diff -- Is this example valid? `RichDoubleDataSet.createHistogram` returns `DataSet[OnlineHistogram]` and it doesn't have methods such as `quantile`, `sum`, ..., etc.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131425060 Hi @sachingoel0101, Thanks for your contribution. I reviewed this PR and commented the source code. There are some problems which aren't commented. In documentation, there are many lines including `a`, `ul`, `li`, 'br', 'strong' and 'i' tag. These can be replaced with markdown syntax. And some codes (`MLUtils.createHistogram`, `ColumnStatistics`) are not formatted. I didn't finish reviewing process of building histogram. After reading full source code and the given paper, I'll add comment about the process. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-129336972 Hi, I just discovered the review request. I'll review this PR soon. Because I'm busy in working for my graduation essay, maybe I can start reviewing on weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---