[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-19 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132552269
  
Sounds good, just make sure to update the title and description of both PRs 
to reflect the current state 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-19 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132512002
  
Okay. Done. 
I will modify the `createContinuousHistogram` to use `DataStats` in #1032, 
since that is going to be merged after this anyway. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-19 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37390659
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,98 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+**Discrete Histograms**: These histograms are formed on a data set 
`X:DataSet[Double]`
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A discrete histogram can be formed by calling 
`MLUtils.createDiscreteHistogram(X)`.
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
*discrete* or
+ *continuous*.
+ br
+ Statistics can be evaluated by calling `DataStats.dataStats(X)` or 
+ `DataStats.dataStats(X, discreteFields`). The latter is used when some 
fields are needed to be 
--- End diff --

Wrong position of backtick symbol. \`DataStats.dataStats(X, 
discreteFields)\`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-19 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132494738
  
Good! But because there is another issue (FLINK-2379) covered the 
column-wise statistics, I think that spliting them would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132128284
  
+1 for having two different methods by return type but we need more 
comments from @tillrohrmann, @thvasilo or other people because I'm not sure 
this is best approach.

Would be okay the method names are `createContinuousHistogram` and 
`createCategoricalHistogram` if we decide create two methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132146208
  
Sounds good. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132170286
  
@thvasilo , @chiwanpark , I've made the required changes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132118352
  
What about having two different functions? One for Discrete and one for 
continuous? 
Or perhaps just one for the Continuous as that is more likely to be used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132134747
  
Okay. That sounds good enough. Can we make a final decision then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132116059
  
I think that we can merge this PR after we decide the return type of 
`createHistogram` method. Any other points seem okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132187018
  
Olrite. I'll be back in a few. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132183402
  
Is there the valid reason to have multiple names ('discrete histogram', 
'categorical histogram')? I think that to avoid confusion, we need to unify the 
names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132188989
  
@chiwanpark , modified the names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132186504
  
I'm also inclined to use Discrete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132193111
  
They're not coupled at all. But both are related to statistics over data 
sets. This is why I combined them both in one. If you're wondering, there is a 
JIRA for the column wise statistics as well. [FLINK-2379].
Unless it is absolutely necessary, I'd like to keep them both in one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132439907
  
Including the collect step in the documentation seems weird, IMO. Since the 
return type is a `DataSet[Histogram]`, anyone familiar with flink will know to 
collect it first before operating on the element itself. 

Of course, one other option would be actually collect and return the 
histogram in the `create` functions, but we don't wanna do that in case it is 
used inside an iteration.
Should I include the collect step in the documentation then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132430686
  
The only problem to merge this is invalid example in documentation. We need 
calling `collect()` method and `apply(0)` to execute statistics methods such as 
`quantile`, `sum`, ..., etc..

Other things seem okay. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132449023
  
Hi @chiwanpark , I have modified the documentation to include the collect 
step. 
I have also made a few modification to how the histogram is created by 
using a `MapPartition` function instead of `Map`. Also, for more consistency, 
changed the signature of `DataStats` creation to return `DataSet[DataStats]` 
instead of `DataStats`.
@thvasilo , I have integrated the statistics part with this. I hope that is 
alright. Please have a look at `MLUtils.scala` to see the usage. If it's not, 
I'd be happy to remove it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132209035
  
I have kept the documentation for statistics in this however. We can merge 
#1032 first and then this to maintain the illusion of cause-effect. :')


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132208131
  
@thvasilo , @chiwanpark , I've split this into two parts. The statistics 
part is now in #1032 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132264800
  
Haha. It's no problem. I've updated the documentation for this. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-18 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132261492
  
Sorry to be a nitpick, but if we are going to split the PR then the 
documentation should be split accordingly, we can merge the column-wise 
statistics once this one is merged. I know this is taking very long to merge, 
but it is better if we do things properly and not rush them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37143932
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
--- End diff --

I meant that the space between `private` and `[statistics]` should be 
removed. It is just cosmetic issue but to keep unified code style, we need fix 
this. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131512180
  
@sachingoel0101 There is no need to forgive. I just reviewed this PR and 
left my opinions. :) If you feel that my comments is aggressive, I'm sorry 
about that.

About cosmetic issues, It seems okay except `MLUtils.createHistogram` 
method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144073
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +120,31 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data
+*
+* @param bins Number of bins required. Zero for 
[[CategoricalHistogram]]
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DataSet]] of [[OnlineHistogram]]
+*/
+  private [ml] def createHistogram(data: DataSet[Double], bins: Int): 
DataSet[OnlineHistogram] = {
+val min = data.reduce((x, y) = Math.min(x, y))
+val max = data.reduce((x, y) = Math.max(x, y))
+
+val stats = min.mapWithBcVariable(max){
+  (minimum,maximum) = (minimum - 2 * (maximum - minimum), maximum + 2 
* (maximum - minimum))
+}
+val ret: DataSet[OnlineHistogram] = data.mapWithBcVariable(stats){ (x, 
statistics) =
+  if(bins  0){
+val h = new ContinuousHistogram(1, statistics._1, statistics._2)
+h.loadData(Array((x, 1)))
+h
+  } else{
--- End diff --

Need a space between `else` and `{`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144068
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +120,31 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data
+*
+* @param bins Number of bins required. Zero for 
[[CategoricalHistogram]]
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DataSet]] of [[OnlineHistogram]]
+*/
+  private [ml] def createHistogram(data: DataSet[Double], bins: Int): 
DataSet[OnlineHistogram] = {
+val min = data.reduce((x, y) = Math.min(x, y))
+val max = data.reduce((x, y) = Math.max(x, y))
+
+val stats = min.mapWithBcVariable(max){
+  (minimum,maximum) = (minimum - 2 * (maximum - minimum), maximum + 2 
* (maximum - minimum))
+}
+val ret: DataSet[OnlineHistogram] = data.mapWithBcVariable(stats){ (x, 
statistics) =
+  if(bins  0){
--- End diff --

`if (bins  0) {` would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131518423
  
The occasional space issues are because I stopped using Idea's reformat 
tool. It sometimes messes up the indentations. :') 
And no. Your comments weren't aggressive. I was apologizing for the trivial 
oversights in my code.
Addressed the latest comments too. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131523559
  
Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131509499
  
Hi @chiwanpark , I've fixed the *cosmetic* issues. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144130
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,100 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. strongContinuous Histograms/strong: These histograms are formed 
on a data set `X:
--- End diff --

`strong` tag can be replaced by `**`. `**Continuous Histograms**` is same 
as `strongContinuous Histograms/strong`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131514958
  
I found some points to improve and added line notes. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144066
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +120,31 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data
+*
+* @param bins Number of bins required. Zero for 
[[CategoricalHistogram]]
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DataSet]] of [[OnlineHistogram]]
+*/
+  private [ml] def createHistogram(data: DataSet[Double], bins: Int): 
DataSet[OnlineHistogram] = {
--- End diff --

`private[ml]` would be better. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144175
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,100 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. strongContinuous Histograms/strong: These histograms are formed 
on a data set `X:
+   DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A categorical histogram can be formed by calling 
`X.createHistogram(0)`.
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
idiscrete/i or
+ icontinuous/i.
--- End diff --

`i` tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144173
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,100 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. strongContinuous Histograms/strong: These histograms are formed 
on a data set `X:
+   DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A categorical histogram can be formed by calling 
`X.createHistogram(0)`.
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
idiscrete/i or
--- End diff --

`i` tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144171
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,100 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. strongContinuous Histograms/strong: These histograms are formed 
on a data set `X:
+   DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
--- End diff --

`strong` tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144183
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match{
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match{
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+var finalSize = finalMap.size
+if (B  0) {
+  finalSize = B
+}
+val ret = new CategoricalHistogram(finalSize)
+ret.loadData(finalMap.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a categorical histogram is 
allowed to be merged with a  +
+  categorical histogram)
+}
+  }
+
+  /** Number of elements in category c
+*
+* @return Number of points in category c
+*/
+  def count(c: Double): Int = {
+data.get(c) match{
--- End diff --

We need a space between `match` and `{`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144182
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match{
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match{
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+var finalSize = finalMap.size
+if (B  0) {
+  finalSize = B
+}
--- End diff --

We can change `finalSize` from mutable to immutable.

```scala
val finalSize = if (B  0) B else finalMap.size
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144195
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match{
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match{
--- End diff --

A space is needed between `match` and `{`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144192
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match{
--- End diff --

A space is needed between `match` and `{`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37144137
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,100 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. strongContinuous Histograms/strong: These histograms are formed 
on a data set `X:
+   DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
--- End diff --

`i` tag can be replace by `*`. `iscaled/i` can be represented as 
`*scaled*`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-16 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131520764
  
I just found that `Predef.assume` could be replaced by `assume`. Sorry for 
wrong guidance. Could you change `Predef.assume` to `assume`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread sachingoel0101
Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37141049
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
+  }
+
+  /**
+   * Returns the Gini impurity for this [[DISCRETE]] field.
+   */
+  def gini: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total)
+  }
+
+  //-- Setter methods --//
+  private [statistics] def setContinuousParameters(
--- End diff --

Gini impurity should be accessible to user even outside the package. It is 
a property of the field itself and we're not using it ourselves anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131474758
  
Hi @chiwanpark, thanks for reviewing this. This was my first time working 
in Scala, so I hope you'll forgive the slight mistakes (oversights, perhaps?). 
I've tried to address most of your comments and left notes where I was not 
sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread sachingoel0101
Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37141028
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
--- End diff --

Could you clarify this point? They're already package private fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread sachingoel0101
Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37141076
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,108 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+ ul
+  li
+   strongContinuous Histograms/strong: These histograms are formed on 
a data set `X: DataSet[Double]` 
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+A continuous histogram can be formed by calling `X.createHistogram(b)` 
where `b` is the
+number of bins.
+  /li
+  li
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A categorical histogram can be formed by calling 
`X.createHistogram(0)`.
+  /li
+ /ul
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
idiscrete/i or
+ icontinuous/i.
+ br
+ Statistics can be evaluated by calling `DataStats.dataStats(X)` or 
+ `DataStats.dataStats(X, discreteFields`). The latter is used when some 
fields are needed to be 
+ declared discrete-valued, and is provided as an array of indices of 
fields which are discrete.
+ br
+ The following information is available as part of `DataStats`:
+ ul
+liNumber of elements in `X`/li
+liDimension of `X`/li
+liColumn-wise statistics where for discrete fields, we report counts 
for each category, and
+ the Gini impurity and Entropy of the field, while for continuous 
fields, we report the
+ minimum, maximum, mean and variance.
+/li
+ /ul
+
+## Examples
+
+{% highlight scala %}
+
+import org.apache.flink.ml.statistics._
+import org.apache.flink.ml._
+
+val X: DataSet[Double] = ...
+// Create continuous histogram
+val histogram = X.createHistogram(5) // creates a histogram with five 
bins
+histogram.quantile(0.3)  // returns the 30th quantile
+histogram.sum(4) // returns number of elements 
less than 4
--- End diff --

You're right. I'm wondering if we should provide two methods for this 
purpose. Otherwise user will need to cast into the appropriate class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137570
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +120,30 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[DataSet]] of [[OnlineHistogram]] from the input data
+*
+* @param bins Number of bins required. Zero for 
[[CategoricalHistogram]]
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DataSet]] of [[OnlineHistogram]]
+*/
+  private [ml] def createHistogram(data: DataSet[Double], bins: Int): 
DataSet[OnlineHistogram] = {
+val min = data.map(x = x).reduce((x,y) = Math.min(x,y))
+val max = data.map(x = x).reduce((x,y) = Math.max(x,y))
--- End diff --

We don't need `map(x = x)` operation and need space between `x,` and `y`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137605
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+if (data.get(c).isEmpty) {
+  require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
+  data.put(c, 1)
+} else {
+  data.update(c, data.get(c).get + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  if (finalMap.get(x._1).isEmpty) {
+finalMap.put(x._1, x._2)
+  } else {
+finalMap.update(x._1, x._2 + finalMap.get(x._1).get)
+  }
+})
--- End diff --

As I said above, we can reduce calling `get(x._1)` by using pattern 
matching for `finalMap.get(x._1)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137592
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+if (data.get(c).isEmpty) {
+  require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
+  data.put(c, 1)
+} else {
+  data.update(c, data.get(c).get + 1)
+}
--- End diff --

If use pattern matching for `data.get(c)`, we can reduce calling 
`data.get(c)`.

For example:

```scala
data.get(c) match {
  case None =
require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
data.put(c, 1)
  case Some(v) = data.update(c, v + 1)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137631
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+if (data.get(c).isEmpty) {
+  require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
+  data.put(c, 1)
+} else {
+  data.update(c, data.get(c).get + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  if (finalMap.get(x._1).isEmpty) {
+finalMap.put(x._1, x._2)
+  } else {
+finalMap.update(x._1, x._2 + finalMap.get(x._1).get)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val countBuffer = new mutable.ArrayBuffer[(Double, Int)]()
+finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2)))
+var finalSize = finalMap.size
+if (B  0) {
+  finalSize = B
+}
+val ret = new CategoricalHistogram(finalSize)
+ret.loadData(countBuffer.toArray)
--- End diff --

Why the data in `finalMap` should be copied to `countBuffer`?

```scala
val finalSize = if (B  0) B else finalMap.size
val ret = new CategoricalHistogram(finalSize)
ret.loadData(finalMap.toArray)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137647
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+if (data.get(c).isEmpty) {
+  require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
+  data.put(c, 1)
+} else {
+  data.update(c, data.get(c).get + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  if (finalMap.get(x._1).isEmpty) {
+finalMap.put(x._1, x._2)
+  } else {
+finalMap.update(x._1, x._2 + finalMap.get(x._1).get)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val countBuffer = new mutable.ArrayBuffer[(Double, Int)]()
+finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2)))
+var finalSize = finalMap.size
+if (B  0) {
+  finalSize = B
+}
+val ret = new CategoricalHistogram(finalSize)
+ret.loadData(countBuffer.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a categorical histogram is 
allowed to be merged with a  +
+  categorical histogram)
+}
+  }
+
+  /** Number of elements in category c
+*
+* @return Number of points in category c
+*/
+  def count(c: Double): Int = {
+if (data.get(c).isEmpty) {
+  return 0
+}
+data.get(c).get
+  }
--- End diff --

I think following is better than current implementation.

```scala
def count(c: Double): Int = data.get(c) match {
  case None = 0
  case Some(v) = v
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137658
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/CategoricalHistogram.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  *   Number of categories in the histogram
+  */
+case class CategoricalHistogram(numCategories: Int) extends 
OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+if (data.get(c).isEmpty) {
+  require(data.size  numCategories, Insufficient capacity. Failed to 
add.)
+  data.put(c, 1)
+} else {
+  data.update(c, data.get(c).get + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): CategoricalHistogram 
= {
+h match {
+  case h1: CategoricalHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  if (finalMap.get(x._1).isEmpty) {
+finalMap.put(x._1, x._2)
+  } else {
+finalMap.update(x._1, x._2 + finalMap.get(x._1).get)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val countBuffer = new mutable.ArrayBuffer[(Double, Int)]()
+finalMap.iterator.foreach(x = countBuffer += ((x._1, x._2)))
+var finalSize = finalMap.size
+if (B  0) {
+  finalSize = B
+}
+val ret = new CategoricalHistogram(finalSize)
+ret.loadData(countBuffer.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a categorical histogram is 
allowed to be merged with a  +
+  categorical histogram)
+}
+  }
+
+  /** Number of elements in category c
+*
+* @return Number of points in category c
+*/
+  def count(c: Double): Int = {
+if (data.get(c).isEmpty) {
+  return 0
+}
+data.get(c).get
+  }
+
+  /** Returns the total number of elements in the histogram
+*
+* @return total number of elements added so far
+*/
+  override def total: Int = {
+var result = 0
+data.valuesIterator.foreach(x = result += x)
+result
+  }
--- End diff --

We can implement `total` method without mutable variable.

```scala
override def total: Int = data.valuesIterator.sum
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137718
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137756
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137749
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137752
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137762
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137774
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
--- End diff --

Just for cosmetic, `private[statistics]` would be better to match current 
coding style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137811
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
+  }
+
+  /**
+   * Returns the Gini impurity for this [[DISCRETE]] field.
+   */
+  def gini: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total)
+  }
+
+  //-- Setter methods --//
+  private [statistics] def setContinuousParameters(
+  min: Double,
+  max: Double,
+  mean: Double,
+  variance: Double)
+: Unit = {
+exception(DISCRETE)
+_min = min
+_max = max
+_mean = mean
+_variance = variance
+  }
+
+  private [statistics] def setDiscreteParameters(counts: 
mutable.HashMap[Double,Int]): Unit = {
+exception(CONTINUOUS)
+_counts = counts
+  }
+
+  private def exception(checkFor: FieldType) = {
+if(fieldType == checkFor){
+  throw new RuntimeException(Invalid access of data. Check field 
types.)
+}
+  }
--- End diff --

Maybe If we check `FieldType` in each method (`min`, `max`, `mean`, ..., 
`gini`) with `Predef.assume` method, the user can understand what is the 
problem more properly. For example, when the user call `gini` method for 
continuous histogram, we can throw exception with detailed message such as 
Gini impurity for continuous histogram is not supported..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137819
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
+  }
+
+  /**
+   * Returns the Gini impurity for this [[DISCRETE]] field.
+   */
+  def gini: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total)
+  }
+
+  //-- Setter methods --//
+  private [statistics] def setContinuousParameters(
--- End diff --

Also `private[statistics]` would be better in here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137821
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
+  }
+
+  /**
+   * Returns the Gini impurity for this [[DISCRETE]] field.
+   */
+  def gini: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total)
+  }
+
+  //-- Setter methods --//
+  private [statistics] def setContinuousParameters(
+  min: Double,
+  max: Double,
+  mean: Double,
+  variance: Double)
+: Unit = {
+exception(DISCRETE)
+_min = min
+_max = max
+_mean = mean
+_variance = variance
+  }
+
+  private [statistics] def setDiscreteParameters(counts: 
mutable.HashMap[Double,Int]): Unit = {
--- End diff --

Also `private[statistics]` would be better in here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137856
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
+var total = 0
+for (i - 0 to bins - 1) {
+  total = total + getCounter(i)
+}
+val wantedSum = (q * total).round.toInt
+var currSum = count(getValue(0))
+
+if (wantedSum  currSum) {
+  require(lower  -MaxValue, Set a lower bound before 

[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137866
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
+  }
+
+  /**
+   * Returns the Gini impurity for this [[DISCRETE]] field.
+   */
+  def gini: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+1 - _counts.iterator.map(x = (x._2 * x._2)).sum / (total * total)
--- End diff --

Unnecessary parentheses in `(x._2 * x._2)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37137965
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/FieldStats.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import org.apache.flink.ml.statistics.FieldType._
+
+import scala.collection.mutable
+
+/** Class to represent Field statistics.
+  *
+  * =Parameters=
+  * -[[fieldType]]:
+  *   Type of this field, [[DISCRETE]] or [[CONTINUOUS]]
+  *
+  * For [[DISCRETE]] fields, [[entropy]], [[gini]] and [[categoryCounts]] 
are provided.
+  * For [[CONTINUOUS]] fields, [[min]], [[max]], [[mean]] and [[variance]] 
are provided.
+  *
+  */
+class FieldStats(val fieldType: FieldType) extends Serializable {
+  // field parameters
+  private [statistics] var _min: Double = _
+  private [statistics] var _max: Double = _
+  private [statistics] var _mean: Double = _
+  private [statistics] var _variance: Double = _
+  private [statistics] var _counts: mutable.HashMap[Double,Int] = _
+
+  // Access methods //
+  def min: Double = {
+exception(DISCRETE)
+_min
+  }
+
+  def max: Double = {
+exception(DISCRETE)
+_max
+  }
+
+  def mean: Double = {
+exception(DISCRETE)
+_mean
+  }
+
+  def variance: Double = {
+exception(DISCRETE)
+_variance
+  }
+
+  def categoryCounts: mutable.HashMap[Double,Int] = {
+exception(CONTINUOUS)
+_counts
+  }
+
+  /**
+   * Returns the entropy value for this [[DISCRETE]] field.
+   */
+  def entropy: Double = {
+exception(CONTINUOUS)
+val total: Double = _counts.valuesIterator.sum
+3.322 * _counts.iterator.map(x = - (x._2 / total) * Math.log10(x._2 / 
total)).sum
--- End diff --

Is 3.322 used for changing base of log from 10 to 2? Why don't you use 
`Math.log(~~~) / Math.log(2)`? Saving `Math.log(2)` into an immutable class 
member is good for performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37138026
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  *   Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  *   Lower limit on the elements
+  *
+  * -[[max]]:
+  *   Upper limit on the elements
+  */
+case class ContinuousHistogram(capacity: Int, min: Double, max: Double) 
extends OnlineHistogram {
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += (data.apply(i))
--- End diff --

Unnecessary parenthesis


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37138063
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,108 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+ ul
+  li
+   strongContinuous Histograms/strong: These histograms are formed on 
a data set `X: DataSet[Double]` 
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+A continuous histogram can be formed by calling `X.createHistogram(b)` 
where `b` is the
+number of bins.
+  /li
+  li
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A categorical histogram can be formed by calling 
`X.createHistogram(0)`.
+  /li
+ /ul
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
idiscrete/i or
+ icontinuous/i.
+ br
+ Statistics can be evaluated by calling `DataStats.dataStats(X)` or 
+ `DataStats.dataStats(X, discreteFields`). The latter is used when some 
fields are needed to be 
+ declared discrete-valued, and is provided as an array of indices of 
fields which are discrete.
+ br
+ The following information is available as part of `DataStats`:
+ ul
+liNumber of elements in `X`/li
+liDimension of `X`/li
+liColumn-wise statistics where for discrete fields, we report counts 
for each category, and
+ the Gini impurity and Entropy of the field, while for continuous 
fields, we report the
+ minimum, maximum, mean and variance.
+/li
+ /ul
+
+## Examples
+
+{% highlight scala %}
+
+import org.apache.flink.ml.statistics._
+import org.apache.flink.ml._
+
+val X: DataSet[Double] = ...
+// Create continuous histogram
+val histogram = X.createHistogram(5) // creates a histogram with five 
bins
+histogram.quantile(0.3)  // returns the 30th quantile
+histogram.sum(4) // returns number of elements 
less than 4
+
+// Create categorical histogram
+val histogram = X.createHistogram(0) // creates a categorical histogram
+histogram.count(3)   // number of elements with 
cateogory value 3
--- End diff --

cateogory - category


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37138093
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,108 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` 
and `dataStats`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+ ul
+  li
+   strongContinuous Histograms/strong: These histograms are formed on 
a data set `X: DataSet[Double]` 
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
iscaled/i probability].
+   br
+A continuous histogram can be formed by calling `X.createHistogram(b)` 
where `b` is the
+number of bins.
+  /li
+  li
+strongCategorical Histograms/strong: These histograms are formed 
on a data set `X:DataSet[Double]` 
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A categorical histogram can be formed by calling 
`X.createHistogram(0)`.
+  /li
+ /ul
+
+### Data Statistics
+
+ The `dataStats` function operates on a data set `X: DataSet[Vector]` and 
returns column-wise
+ statistics for `X`. Every field of `X` is allowed to be defined as either 
idiscrete/i or
+ icontinuous/i.
+ br
+ Statistics can be evaluated by calling `DataStats.dataStats(X)` or 
+ `DataStats.dataStats(X, discreteFields`). The latter is used when some 
fields are needed to be 
+ declared discrete-valued, and is provided as an array of indices of 
fields which are discrete.
+ br
+ The following information is available as part of `DataStats`:
+ ul
+liNumber of elements in `X`/li
+liDimension of `X`/li
+liColumn-wise statistics where for discrete fields, we report counts 
for each category, and
+ the Gini impurity and Entropy of the field, while for continuous 
fields, we report the
+ minimum, maximum, mean and variance.
+/li
+ /ul
+
+## Examples
+
+{% highlight scala %}
+
+import org.apache.flink.ml.statistics._
+import org.apache.flink.ml._
+
+val X: DataSet[Double] = ...
+// Create continuous histogram
+val histogram = X.createHistogram(5) // creates a histogram with five 
bins
+histogram.quantile(0.3)  // returns the 30th quantile
+histogram.sum(4) // returns number of elements 
less than 4
--- End diff --

Is this example valid? `RichDoubleDataSet.createHistogram` returns 
`DataSet[OnlineHistogram]` and it doesn't have methods such as `quantile`, 
`sum`, ..., etc..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-15 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-131425060
  
Hi @sachingoel0101, Thanks for your contribution. I reviewed this PR and 
commented the source code.

There are some problems which aren't commented. In documentation, there are 
many lines including `a`, `ul`, `li`, 'br', 'strong' and 'i' tag. 
These can be replaced with markdown syntax. And some codes 
(`MLUtils.createHistogram`, `ColumnStatistics`) are not formatted.

I didn't finish reviewing process of building histogram. After reading full 
source code and the given paper, I'll add comment about the process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...

2015-08-10 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-129336972
  
Hi, I just discovered the review request. I'll review this PR soon. Because 
I'm busy in working for my graduation essay, maybe I can start reviewing on 
weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---