[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730562#comment-14730562
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137690196
  
Looks good to merge except some minor issues. If there is no objection for 
this PR, I'll merge this in tomorrow.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730560#comment-14730560
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38735587
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
+ */
+public class ContinuousHistogram implements Accumulator> {
+
+   protected TreeMap treeMap = new TreeMap();
+
+   protected long counter = 0;
+
+   private int bin;
+
+   private double lower;
+
+   private double upper;
+
+   private PriorityQueue diffQueue;
+
+   private HashMap keyUpdateTimes;
+
+   private long timestamp;
+
+   /**
+* Creates a new Continuous histogram with the given number of bins
+* Bins represents the number of values the histogram stores to 
approximate the continuous
+* data set. The higher this value, the more we move towards an exact 
representation of data.
+*
+* @param numBins Number of bins in the histogram
+*/
+   public ContinuousHistogram(int numBins) {
+   if (numBins <= 0) {
+   throw new IllegalArgumentException("Number of bins must 
be greater than zero");
+   }
+   bin = numBins;
+   lower = MAX_VALUE;
+   upper = -MAX_VALUE;
+   diffQueue = new PriorityQueue<>();
+   keyUpdateTimes = new HashMap<>();
+   timestamp = 0;
+   }
+
+   /**
+* Consider using {@link #add(double)} for primitive double values to 
get better performance.
+*/
+   @Override
+   public void add(Double value) {
+   add(value, 1);
+   }
+
+   public void add(double value) {
+   add(value, 1);
+   }
+
+   @Override
+   public TreeMap getLocalValue() {
+   return this.treeMap;
+   }
+
+   /**
+* Get the total number of items added to this histogram.
+

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730581#comment-14730581
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38736955
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
+ */
+public class ContinuousHistogram implements Accumulator> {
+
+   protected TreeMap treeMap = new TreeMap();
+
+   protected long counter = 0;
+
+   private int bin;
+
+   private double lower;
+
+   private double upper;
+
+   private PriorityQueue diffQueue;
+
+   private HashMap keyUpdateTimes;
+
+   private long timestamp;
+
+   /**
+* Creates a new Continuous histogram with the given number of bins
+* Bins represents the number of values the histogram stores to 
approximate the continuous
+* data set. The higher this value, the more we move towards an exact 
representation of data.
+*
+* @param numBins Number of bins in the histogram
+*/
+   public ContinuousHistogram(int numBins) {
+   if (numBins <= 0) {
+   throw new IllegalArgumentException("Number of bins must 
be greater than zero");
+   }
+   bin = numBins;
+   lower = MAX_VALUE;
+   upper = -MAX_VALUE;
+   diffQueue = new PriorityQueue<>();
+   keyUpdateTimes = new HashMap<>();
+   timestamp = 0;
+   }
+
+   /**
+* Consider using {@link #add(double)} for primitive double values to 
get better performance.
+*/
+   @Override
+   public void add(Double value) {
+   add(value, 1);
+   }
+
+   public void add(double value) {
+   add(value, 1);
+   }
+
+   @Override
+   public TreeMap getLocalValue() {
+   return this.treeMap;
+   }
+
+   /**
+* Get the total number of items added to this histogram.

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730578#comment-14730578
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38736695
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
+ */
+public class ContinuousHistogram implements Accumulator> {
+
+   protected TreeMap treeMap = new TreeMap();
+
+   protected long counter = 0;
+
+   private int bin;
+
+   private double lower;
+
+   private double upper;
+
+   private PriorityQueue diffQueue;
+
+   private HashMap keyUpdateTimes;
+
+   private long timestamp;
+
+   /**
+* Creates a new Continuous histogram with the given number of bins
+* Bins represents the number of values the histogram stores to 
approximate the continuous
+* data set. The higher this value, the more we move towards an exact 
representation of data.
+*
+* @param numBins Number of bins in the histogram
+*/
+   public ContinuousHistogram(int numBins) {
+   if (numBins <= 0) {
+   throw new IllegalArgumentException("Number of bins must 
be greater than zero");
+   }
+   bin = numBins;
+   lower = MAX_VALUE;
+   upper = -MAX_VALUE;
+   diffQueue = new PriorityQueue<>();
+   keyUpdateTimes = new HashMap<>();
+   timestamp = 0;
+   }
+
+   /**
+* Consider using {@link #add(double)} for primitive double values to 
get better performance.
+*/
+   @Override
+   public void add(Double value) {
+   add(value, 1);
+   }
+
+   public void add(double value) {
+   add(value, 1);
+   }
+
+   @Override
+   public TreeMap getLocalValue() {
+   return this.treeMap;
+   }
+
+   /**
+* Get the total number of items added to this histogram.

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730561#comment-14730561
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38735611
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
+ */
+public class ContinuousHistogram implements Accumulator> {
+
+   protected TreeMap treeMap = new TreeMap();
+
+   protected long counter = 0;
+
+   private int bin;
+
+   private double lower;
+
+   private double upper;
+
+   private PriorityQueue diffQueue;
+
+   private HashMap keyUpdateTimes;
+
+   private long timestamp;
+
+   /**
+* Creates a new Continuous histogram with the given number of bins
+* Bins represents the number of values the histogram stores to 
approximate the continuous
+* data set. The higher this value, the more we move towards an exact 
representation of data.
+*
+* @param numBins Number of bins in the histogram
+*/
+   public ContinuousHistogram(int numBins) {
+   if (numBins <= 0) {
+   throw new IllegalArgumentException("Number of bins must 
be greater than zero");
+   }
+   bin = numBins;
+   lower = MAX_VALUE;
+   upper = -MAX_VALUE;
+   diffQueue = new PriorityQueue<>();
+   keyUpdateTimes = new HashMap<>();
+   timestamp = 0;
+   }
+
+   /**
+* Consider using {@link #add(double)} for primitive double values to 
get better performance.
+*/
+   @Override
+   public void add(Double value) {
+   add(value, 1);
+   }
+
+   public void add(double value) {
+   add(value, 1);
+   }
+
+   @Override
+   public TreeMap getLocalValue() {
+   return this.treeMap;
+   }
+
+   /**
+* Get the total number of items added to this histogram.
+

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730559#comment-14730559
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38735464
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
--- End diff --

Please remove spaces around `=`


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728571#comment-14728571
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38616759
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/accumulators/ContinuousHistogram.java
 ---
@@ -0,0 +1,490 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.api.common.accumulators;
+
+import java.util.AbstractMap;
+import java.util.Iterator;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Set;
+import java.util.TreeMap;
+
+import static java.lang.Double.MAX_VALUE;
+
+/**
+ * A Histogram accumulator designed for Continuous valued data.
+ * It supports:
+ * -- {@link #quantile(double)}
+ * Computes a quantile of the data
+ * -- {@link #count(double)}
+ * Computes number of items less than the given value in the data
+ * 
+ * A continuous histogram stores values in bins in sorted order and keeps 
their associated
+ * number of items. It is assumed that the items associated with every bin 
are scattered around
+ * it, half to the right and half to the left.
+ * 
+ * bin counters:  m_1m_2m_3m_4m_5m_6
+ *10 12 5  10 4  6
+ *|  5   |  6   |  2.5 |  5   |  2   |
+ * 5  |  +   |  +   |   +  |  +   |  +   |  3
+ *|  6   |  2.5 |   5  |  2   |  3   |
+ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ * bin index: 1  2  3  4  5  6
+ * bin values:v_1 <  v_2 <  v_3 <  v_4 <  v_5 <  v_6
+ * 
+ * The number of items between v_i and v_(i+1) is directly proportional to 
the area of
+ * trapezoid (v_i, v_(i+1), m_(i+1), m_i)
+ * 
+ * Adapted from Ben-Haim and Yom-Tov's
+ * http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf;>Streaming 
Decision Tree Algorithm's histogram
+ */
+public class ContinuousHistogram extends Histogram {
+
+   private int bin;
+   private double lower;
+   private double upper;
+   private PriorityQueue diffQueue;
+   private HashMap keyUpdateTimes;
+   private long timestamp;
+
+   /**
+* Creates a new Continuous histogram with the given number of bins
+* Bins represents the number of values the histogram stores to 
approximate the continuous
+* data set. The higher this value, the more we move towards an exact 
representation of data.
+*
+* @param bin Number of bins in the histogram
+*/
+   public ContinuousHistogram(int bin) {
+   if (bin <= 0) {
+   throw new IllegalArgumentException("Number of bins must 
be greater than zero");
+   }
+   this.bin = bin;
+   lower = MAX_VALUE;
+   upper = -MAX_VALUE;
+   diffQueue = new PriorityQueue<>();
+   keyUpdateTimes = new HashMap<>();
+   timestamp = 0;
+   }
+
+   @Override
+   public void resetLocal() {
+   super.resetLocal();
+   this.lower = MAX_VALUE;
+   this.upper = -MAX_VALUE;
+   this.diffQueue.clear();
+   this.keyUpdateTimes.clear();
+   }
+
+   /**
+* Merges the given other histogram into this histogram, with the 
number of bins in the
+* merged histogram being {@code numBins}.
+*
+* @param other   Histogram to be merged
+* @param numBins Bins in the merged histogram
+*/
+   public void merge(Accumulator> other, 
int numBins) {
+   bin = numBins;
+   super.merge(other);
+   }
+
+   

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728701#comment-14728701
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137379709
  
+1 for adding a class for `Double` values.

From my curiosity, Could `DiscreteHistogram` be used for decision tree? In 
the given paper, they used histogram based continuous data.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728709#comment-14728709
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137380847
  
Yes. For Discrete fields, quantiles do not make sense. In the paper, they 
only cover the continuous fields, since the Discrete fields are more or less 
trivial to handle. [Unless there are too many categories].
However, if we separate out the two histogram types, there is no need to 
implement a base class. The only shared functionality is the basic infra and 
fields. But the effective use of both is different as you pointed out. Or 
should I do that? I really can't settle on this.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728718#comment-14728718
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137382016
  
Okay. Please take a look now. 
[There is an unintentional formatting introduced by IDE in Histogram.java. 
Will remove it while squashing.]


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728625#comment-14728625
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38619843
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java ---
@@ -248,6 +251,58 @@ public void mapPartition(Iterable values, 
Collector> out) thr
input.getType(), sampleInCoordinator, callLocation);
}
 
+   /**
+* Creates a {@link 
org.apache.flink.api.common.accumulators.DiscreteHistogram} from the data set
+*
+* @param data Discrete valued data set
+* @return A histogram over data
+*/
+   public static DataSet 
createDiscreteHistogram(DataSet data) {
+   return data.mapPartition(new RichMapPartitionFunction() {
+   @Override
+   public void mapPartition(Iterable values, 
Collector out)
+   throws Exception {
+   DiscreteHistogram histogram = new 
DiscreteHistogram();
+   for (double value : values) {
+   histogram.add(value);
+   }
+   out.collect(histogram);
+   }
+   }).reduce(new ReduceFunction() {
+   @Override
+   public DiscreteHistogram reduce(DiscreteHistogram 
value1, DiscreteHistogram value2) throws Exception {
+   value1.merge(value2);
+   return value1;
+   }
+   });
+   }
+
+   /**
+* Creates a {@link 
org.apache.flink.api.common.accumulators.DiscreteHistogram} from the data set
+*
+* @param data Discrete valued data set
+* @param bins Number of bins in the histogram
+* @return A histogram over data
+*/
+   public static DataSet 
createContinuousHistogram(DataSet data, final int bins) {
+   return data.mapPartition(new RichMapPartitionFunction() {
+   @Override
+   public void mapPartition(Iterable values, 
Collector out)
+   throws Exception {
--- End diff --

Same here (unnecessary new line)


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728624#comment-14728624
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r38619828
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java ---
@@ -248,6 +251,58 @@ public void mapPartition(Iterable values, 
Collector> out) thr
input.getType(), sampleInCoordinator, callLocation);
}
 
+   /**
+* Creates a {@link 
org.apache.flink.api.common.accumulators.DiscreteHistogram} from the data set
+*
+* @param data Discrete valued data set
+* @return A histogram over data
+*/
+   public static DataSet 
createDiscreteHistogram(DataSet data) {
+   return data.mapPartition(new RichMapPartitionFunction() {
+   @Override
+   public void mapPartition(Iterable values, 
Collector out)
+   throws Exception {
--- End diff --

Unnecessary new line


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728712#comment-14728712
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137381508
  
I'm inclined to preventing base class for the histogram. Decision Tree will 
be implemented in Scala and we can use pattern matching (case-match statement) 
for solving this.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728685#comment-14728685
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137376194
  
Hi, I just reviewed the updated PR. Sorry for late.

Your implementation is nice. It seems acting like I expected. But I have to 
check the following:

I'm concerned about changing `Histogram` class. It causes API breaking. I'm 
not sure that this breaking is necessary. Reverting the changes of `Histogram` 
would be better. Because there are many differences between 
`ContinuousHistogram` and `DiscreteHistogram`, we don't need to create base 
class for them.

I'm sorry about consuming time to merge this PR. It seems almost arrived 
the goal. Cheer up!


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728687#comment-14728687
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137378160
  
Yes. I was concerned about the API breaking part too. 
What about keeping the original histogram as such, and add something like 
`DoubleHistogram`, which basically works on `Double` values? 
As I look at this more and more, I'm also arriving at the conclusion that 
we need not have a base class for the two histogram types. Lemme change this 
and I'll push a patch in a while.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728611#comment-14728611
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137358805
  
Fixed.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728812#comment-14728812
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-137399688
  
Hey @chiwanpark , travis passes. Let me know if I should squash the 
commits. I'm only keeping them in case we need to go back to the Scala 
implementation. [which I really don't see the point of though.]


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725673#comment-14725673
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-136790993
  
Ah, Sorry for late. Currently, I'm busy. Maybe I need 1-2 days to review. 
I'll try to review asap.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-09-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725674#comment-14725674
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-136791499
  
Okay. Sure. :) No problem.


> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708418#comment-14708418
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133870941
  
+1 for moving histogram functions into `DataSetUtils`. It would be helpful 
for range partitioning. I'll review this in next days.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708043#comment-14708043
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133707118
  
@tillrohrmann, thanks for the brilliant suggestions. Using a `TreeMap` and 
`PriorityQueue` with invalidation, I've managed to bring down the complexity of 
the `add` and `merge` operations to logarithmic time. Further, `quantile` and 
`count` are also linear only, as they should be.
Further, I've decided to put both the Histograms in the `accumulator` 
package since they're supposed to work like one anyway. There already was a 
*discrete* histogram in the `accumulator` package. The *continuous* one now 
resides in the same place.
Also, the `DataSetUtils` class now contains functions to create histograms, 
providing access to these classes from the Java api itself instead of the ML 
library. That was needed to be done sooner or later. Flink-2274 actually asks 
for that. 
@thvasilo @chiwanpark  


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706689#comment-14706689
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133410381
  
Hi @tillrohrmann , I have addressed most of the comments.
1. Improved the documentation
2. Ported the continuous histogram implementation to `TreeMap`
3. Minimized copying of data


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705286#comment-14705286
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133077364
  
I fear that the PR cannot be merged in this state. It contains some serious 
performance issues which first have to be addressed. Mainly they originate from 
performing linear time complexity operations for each element. This will result 
in a quadratic runtime complexity. Once the problems have been addressed, I'll 
review the PR again.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705294#comment-14705294
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37554555
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
--- End diff --

I understand what you want to say, but I think it's not well formulated. 
IMO it's better to clearly define what `sum(s)` or better what `count(s)` 
means. E.g. The value sum(s) represents the number of elements in X whose 
value is less than s as you've said. But the rest is not necessary.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705295#comment-14705295
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37554624
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705299#comment-14705299
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37554683
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
--- End diff --

A TreeSet won't suffice here since we need the values to be sorted.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705302#comment-14705302
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37554750
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
--- End diff --

Then this should be written there. Some of my remarks are more of a 
rhetorical character to point you to what I think is missing there, especially 
when it concerns the documentation.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705305#comment-14705305
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37554978
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+**Discrete Histograms**: These histograms are formed on a data set 
`X:DataSet[Double]`
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A discrete histogram can be formed by calling 
`MLUtils.createDiscreteHistogram(X)`.
--- End diff --

Hmm, why do we have discrete histograms? Are they necessary?


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705306#comment-14705306
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555025
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +123,65 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[ContinuousHistogram]] from the input data
+*
+* @param bins Number of bins required
+* @param data input [[DataSet]] of [[Double]]
+* @return [[ContinuousHistogram]] over the data
+*/
+  def createContinuousHistogram(data: DataSet[Double], bins: Int): 
DataSet[ContinuousHistogram] = {
+val min = data.reduce((x, y) = Math.min(x, y))
+val max = data.reduce((x, y) = Math.max(x, y))
+
+val stats = min.mapWithBcVariable(max) {
+  (minimum, maximum) = (minimum - 2 * (maximum - minimum), maximum + 
2 * (maximum - minimum))
+}
+
+data.mapPartition(new RichMapPartitionFunction[Double, 
ContinuousHistogram] {
--- End diff --

Yes go ahead.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705314#comment-14705314
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555313
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
--- End diff --

Then you're inconsistent with your usage of terms with respect to here and 
the documentation.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705309#comment-14705309
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555199
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +123,65 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[ContinuousHistogram]] from the input data
+*
+* @param bins Number of bins required
+* @param data input [[DataSet]] of [[Double]]
+* @return [[ContinuousHistogram]] over the data
+*/
+  def createContinuousHistogram(data: DataSet[Double], bins: Int): 
DataSet[ContinuousHistogram] = {
+val min = data.reduce((x, y) = Math.min(x, y))
+val max = data.reduce((x, y) = Math.max(x, y))
+
+val stats = min.mapWithBcVariable(max) {
+  (minimum, maximum) = (minimum - 2 * (maximum - minimum), maximum + 
2 * (maximum - minimum))
+}
+
+data.mapPartition(new RichMapPartitionFunction[Double, 
ContinuousHistogram] {
+  var statistics: (Double, Double) = _
+
+  override def open(configuration: Configuration): Unit = {
+statistics = 
getRuntimeContext.getBroadcastVariable(HISTOGRAM_STATS).get(0)
+val minimum = statistics._1
+val maximum = statistics._2
+statistics = (minimum - 2 * (maximum - minimum), maximum + 2 * 
(maximum - minimum))
+  }
+
+  override def mapPartition(
+  values: java.lang.Iterable[Double],
+  out: Collector[ContinuousHistogram])
+: Unit = {
+val localHistogram = new ContinuousHistogram(bins, statistics._1, 
statistics._2)
+val iterator = values.iterator()
+while (iterator.hasNext) {
+  localHistogram.add(iterator.next())
+}
+out.collect(localHistogram)
+  }
+})
+  .withBroadcastSet(stats, HISTOGRAM_STATS)
+  .reduce((x, y) = x.merge(y, bins))
+  }
+
+  /** Create a [[DiscreteHistogram]] from the input data
+*
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DiscreteHistogram]] over the data
+*/
+  def createDiscreteHistogram(data: DataSet[Double]): 
DataSet[DiscreteHistogram] = {
+data.mapPartition(new RichMapPartitionFunction[Double, 
DiscreteHistogram] {
--- End diff --

I don't understand. You can write something like `data.mapPartition{ 
iterator = val myHistogram = ... ; do something with myHistogram; 
Seq(myHistogram) }`. This also only creates a single histogram.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705320#comment-14705320
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555463
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
--- End diff --

What about `Option[Double]`?


 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705323#comment-14705323
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r3737
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705318#comment-14705318
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555399
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/OnlineHistogram.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+/** Base trait for an Online Histogram
--- End diff --

Then you should maybe document this in the code.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705329#comment-14705329
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37555777
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705337#comment-14705337
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556169
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704739#comment-14704739
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132987549
  
Looks good to merge. If there is no opposition in few hours, I'll merge 
this to master.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705342#comment-14705342
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556403
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+**Discrete Histograms**: These histograms are formed on a data set 
`X:DataSet[Double]`
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
+A discrete histogram can be formed by calling 
`MLUtils.createDiscreteHistogram(X)`.
--- End diff --

Well, they provide a nice way to represent Discrete data, providing fast 
access to elements belonging to any class. Plus, they're essential to the 
decision tree implementation.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705347#comment-14705347
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556566
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/MLUtils.scala ---
@@ -119,4 +123,65 @@ object MLUtils {
 
 stringRepresentation.writeAsText(filePath)
   }
+
+  /** Create a [[ContinuousHistogram]] from the input data
+*
+* @param bins Number of bins required
+* @param data input [[DataSet]] of [[Double]]
+* @return [[ContinuousHistogram]] over the data
+*/
+  def createContinuousHistogram(data: DataSet[Double], bins: Int): 
DataSet[ContinuousHistogram] = {
+val min = data.reduce((x, y) = Math.min(x, y))
+val max = data.reduce((x, y) = Math.max(x, y))
+
+val stats = min.mapWithBcVariable(max) {
+  (minimum, maximum) = (minimum - 2 * (maximum - minimum), maximum + 
2 * (maximum - minimum))
+}
+
+data.mapPartition(new RichMapPartitionFunction[Double, 
ContinuousHistogram] {
+  var statistics: (Double, Double) = _
+
+  override def open(configuration: Configuration): Unit = {
+statistics = 
getRuntimeContext.getBroadcastVariable(HISTOGRAM_STATS).get(0)
+val minimum = statistics._1
+val maximum = statistics._2
+statistics = (minimum - 2 * (maximum - minimum), maximum + 2 * 
(maximum - minimum))
+  }
+
+  override def mapPartition(
+  values: java.lang.Iterable[Double],
+  out: Collector[ContinuousHistogram])
+: Unit = {
+val localHistogram = new ContinuousHistogram(bins, statistics._1, 
statistics._2)
+val iterator = values.iterator()
+while (iterator.hasNext) {
+  localHistogram.add(iterator.next())
+}
+out.collect(localHistogram)
+  }
+})
+  .withBroadcastSet(stats, HISTOGRAM_STATS)
+  .reduce((x, y) = x.merge(y, bins))
+  }
+
+  /** Create a [[DiscreteHistogram]] from the input data
+*
+* @param data input [[DataSet]] of [[Double]]
+* @return [[DiscreteHistogram]] over the data
+*/
+  def createDiscreteHistogram(data: DataSet[Double]): 
DataSet[DiscreteHistogram] = {
+data.mapPartition(new RichMapPartitionFunction[Double, 
DiscreteHistogram] {
--- End diff --

Yeah. Already done this. Will push with the rest of the changes.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705377#comment-14705377
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557797
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
--- End diff --

Yes. I'm working on this. The best way to optimize this would be to change 
the signature and have `merge` actually modify the current histogram. This 
would be minimally expensive in terms of copying the data. But comes at the 
cost of having to duplicate the quantities `min`, `max` and `capacity`.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705348#comment-14705348
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556594
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
--- End diff --

Yeah. That should work.


 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705362#comment-14705362
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557222
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
--- End diff --

??? Why don't you simple create the new ContinuousHistogram first. Directly 
operate on the data field and then assure that everything is merged correctly?


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
   

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705383#comment-14705383
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37558043
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
--- End diff --

Have you read the JavaDocs of a `TreeSet`? In the first line it's written 

 The elements are ordered using their natural ordering, or by a Comparator 
provided at set creation time, depending on which constructor is used


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705385#comment-14705385
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37558122
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705354#comment-14705354
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556791
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705353#comment-14705353
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556789
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705352#comment-14705352
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556758
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705387#comment-14705387
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37558286
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705350#comment-14705350
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556660
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705369#comment-14705369
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557444
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705367#comment-14705367
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557389
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
--- End diff --

Again, this is important for the decision tree work.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705371#comment-14705371
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557490
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705423#comment-14705423
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37560144
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705360#comment-14705360
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557013
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): DiscreteHistogram = {
+h match {
+  case h1: DiscreteHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match {
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
--- End diff --

This should maybe be documented somewhere.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705358#comment-14705358
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37556917
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
--- End diff --

But they only share the concept of an online histogram. For what do you 
they have to inherit from the same trait practically?


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705364#comment-14705364
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37557297
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
--- End diff --

For a data set of vectors, we need to construct histograms over all 
dimensions. If the fields are not all continuous or all discrete, we need a 
common parent class.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705438#comment-14705438
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133099053
  
Lemme optimize things and re-write the docs. I'll push a patch tomorrow. 
Sorry about this.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704791#comment-14704791
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-132996034
  
I'm currently reviewing it as well @chiwanpark. Please give me some more 
minutes :-)


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704839#comment-14704839
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/861#issuecomment-133002882
  
@tillrohrmann Sure. :)


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704863#comment-14704863
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37527518
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
+number of bins.
+**Discrete Histograms**: These histograms are formed on a data set 
`X:DataSet[Double]`
+when the values in `X` are from a discrete distribution. These 
histograms
+support `count(c)` operation which returns the number of elements 
associated with cateogry `c`.
+br
--- End diff --

html tags should be replaced by markdown syntax


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704879#comment-14704879
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37528494
  
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: a href=../mlFlinkML/a - Statistics
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over 
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. **Continuous Histograms**: These histograms are formed on a data set 
`X: DataSet[Double]`
+   when the values in `X` are from a continuous range. These histograms 
support
+   `quantile` and `sum`  operations. Here `quantile(q)` refers to a value 
$x_q$ such that $|x: x
+   \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of 
elements $x \leq s$, which can
+be construed as a cumulative probability value at $s$[Of course, 
*scaled* probability].
+   2. A continuous histogram can be formed by calling 
`X.createHistogram(b)` where `b` is the
--- End diff --

Bins represent the maximum number of numbers we are allowed to store. Since 
we are approximating a continuous distribution, we cannot store all the 
numbers. So, every number, when it arrives, updates the bin values to 
approximate itself better.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704898#comment-14704898
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37529318
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
--- End diff --

Sure. :)


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704917#comment-14704917
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37530184
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
--- End diff --

could we give the other histogram a more meaningful name other than `temp`?


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704927#comment-14704927
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37531273
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
--- End diff --

This will throw an 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704929#comment-14704929
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37531538
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/OnlineHistogram.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+/** Base trait for an Online Histogram
--- End diff --

An Online Histogram is meant to approximate a data set with a distribution. 
So, for example, for discrete valued data, we store counters of every class.
For continuous data, we learn a distribution on a data set as more and more 
elements come along.

It is online in the sense that we don't require the whole data set to build 
it. It is built incrementally, and for two parts of the data set, it can be 
merged to provide statistics for the combined set.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704943#comment-14704943
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532474
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704947#comment-14704947
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532638
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704961#comment-14704961
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532792
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704968#comment-14704968
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533038
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704970#comment-14704970
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533127
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704977#comment-14704977
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533528
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704986#comment-14704986
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534077
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705018#comment-14705018
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37536461
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
--- End diff --

backticks are no valid scaladocs syntax as far as I know


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705023#comment-14705023
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37536712
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
--- End diff --

what is *h*? Would be easier to understand if you write *Merges this 
histogram with the given histogram h. The result is a new histogram* .


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705034#comment-14705034
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37537319
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): DiscreteHistogram = {
+h match {
+  case h1: DiscreteHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
--- End diff --

Moreover, you can simply do `finalMap ++= data`


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705037#comment-14705037
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37537538
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): DiscreteHistogram = {
+h match {
+  case h1: DiscreteHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match {
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val finalSize = if (B  0) B else finalMap.size
+val ret = new DiscreteHistogram(finalSize)
+ret.loadData(finalMap.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a discrete histogram is allowed 
to be merged with a  +
+  discrete histogram)
+}
+  }
+
+  /** Number of elements in category c
+*
+* @return Number of points in category c
+*/
+  def count(c: Double): Int = {
+data.get(c) match {
+  case None = 0
+  case Some(value) = value
+}
+  }
+
+  /** Returns the total number of elements in the histogram
+*
+* @return total number of elements added so far
+*/
+  override def total: Int = data.values.sum
+
+  /** Returns the string representation of the histogram.
+*
+*/
+  override def toString: String = {
+sSize: + bins +   + data.toString
+  }
+
+  /** Loads values and counters into the histogram.
+* This action can only be performed when there the histogram is empty
--- End diff --

remove there


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705035#comment-14705035
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37537431
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): DiscreteHistogram = {
+h match {
+  case h1: DiscreteHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match {
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val finalSize = if (B  0) B else finalMap.size
+val ret = new DiscreteHistogram(finalSize)
+ret.loadData(finalMap.toArray)
--- End diff --

here you basically copy the data again. This is inefficient.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704930#comment-14704930
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37531603
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704933#comment-14704933
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37531931
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704962#comment-14704962
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532870
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704974#comment-14704974
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533250
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704988#comment-14704988
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534088
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704982#comment-14704982
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533933
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704993#comment-14704993
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534398
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704991#comment-14704991
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534224
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705021#comment-14705021
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37536565
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705028#comment-14705028
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37537083
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704931#comment-14704931
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37531763
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
--- End diff --

There is no valid return value 

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704941#comment-14704941
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532289
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704946#comment-14704946
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532616
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704964#comment-14704964
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37532900
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704975#comment-14704975
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533263
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704979#comment-14704979
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37533713
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704990#comment-14704990
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534170
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704997#comment-14704997
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534582
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704994#comment-14704994
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534462
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705000#comment-14705000
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37534762
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705009#comment-14705009
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37535831
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
--- End diff --

A little bit more details would be helpful here.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel
Priority: Minor
  Labels: ML

 For the implementation of the decision tree in 
 https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
 histogram with online updates, merging and equalization features. A reference 
 implementation is provided in [1]
 [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705008#comment-14705008
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37535755
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/ContinuousHistogram.scala
 ---
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.Double.MaxValue
+import scala.collection.mutable
+
+/** Implementation of a continuous valued online histogram
+  * Adapted from Ben-Haim and Yom-Tov's Streaming Decision Tree Algorithm
+  * Refer http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
+  *
+  * =Parameters=
+  * -[[capacity]]:
+  * Number of bins to be used in the histogram
+  *
+  * -[[min]]:
+  * Lower limit on the elements
+  *
+  * -[[max]]:
+  * Upper limit on the elements
+  */
+class ContinuousHistogram(capacity: Int, min: Double, max: Double) extends 
OnlineHistogram {
+
+  private val lower = min
+  private val upper = max
+
+  require(capacity  0, Capacity should be a positive integer)
+  require(lower  upper, Lower must be less than upper)
+
+  val data = new mutable.ArrayBuffer[(Double, Int)]()
+
+  /** Adds a new item to the histogram
+*
+* @param p value to be added
+*/
+  override def add(p: Double): Unit = {
+require(p  lower  p  upper, p +  not in ( + lower + , + upper 
+ ))
+// search for the index where the value is just higher than p
+val search = find(p)
+// add the new value there, shifting everything to the right
+data.insert(search, (p, 1))
+// If we're over capacity or any two elements are within 1e-9 of each 
other, merge.
+// This will take care of the case if p was actually equal to some 
value in the histogram and
+// just increment the value there
+mergeElements()
+  }
+
+  /** Merges the histogram with h and returns a histogram with capacity B
+*
+* @param h histogram to be merged
+* @param B capacity of the resultant histogram
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int): ContinuousHistogram = {
+h match {
+  case temp: ContinuousHistogram = {
+val m: Int = bins
+val n: Int = temp.bins
+var i, j: Int = 0
+val mergeList = new mutable.ArrayBuffer[(Double, Int)]()
+while (i  m || j  n) {
+  if (i = m) {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  } else if (j = n || getValue(i) = temp.getValue(j)) {
+mergeList += data.apply(i)
+i = i + 1
+  } else {
+mergeList += ((temp.getValue(j), temp.getCounter(j)))
+j = j + 1
+  }
+}
+// the size will be brought to capacity while constructing the new 
histogram itself
+val finalLower = Math.min(lower, temp.lower)
+val finalUpper = Math.max(upper, temp.upper)
+val ret = new ContinuousHistogram(B, finalLower, finalUpper)
+ret.loadData(mergeList.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a continuous histogram is allowed 
to be merged with a  +
+  continuous histogram)
+
+}
+  }
+
+  /** Returns the qth quantile of the histogram
+*
+* @param q Quantile value in (0,1)
+* @return Value at quantile q
+*/
+  def quantile(q: Double): Double = {
+require(bins  0, Histogram is empty)
+require(q  0  q  1, Quantile must be between 0 and 1)
  

[jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-08-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705040#comment-14705040
 ] 

ASF GitHub Bot commented on FLINK-2030:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/861#discussion_r37537658
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/statistics/DiscreteHistogram.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.statistics
+
+import scala.collection.mutable
+
+/** Implementation of a discrete valued online histogram
+  *
+  * =Parameters=
+  * -[[numCategories]]:
+  * Number of categories in the histogram
+  */
+class DiscreteHistogram(numCategories: Int) extends OnlineHistogram {
+
+  require(numCategories  0, Capacity must be greater than zero)
+  val data = new mutable.HashMap[Double, Int]()
+
+  /** Number of categories in the histogram
+*
+* @return number of categories
+*/
+  override def bins: Int = {
+numCategories
+  }
+
+  /** Increment count of category c
+*
+* @param c category whose count needs to be incremented
+*/
+  override def add(c: Double): Unit = {
+data.get(c) match {
+  case None =
+require(data.size  numCategories, Insufficient capacity. Failed 
to add.)
+data.put(c, 1)
+  case Some(value) =
+data.update(c, value + 1)
+}
+  }
+
+  /** Merges the histogram with h and returns a new histogram
+*
+* @param h histogram to be merged
+* @param B number of categories in the resultant histogram.
+*  (Default: ```0```, number of categories will be the size of 
union of categories in
+*  both histograms)
+* @return Merged histogram with capacity B
+*/
+  override def merge(h: OnlineHistogram, B: Int = 0): DiscreteHistogram = {
+h match {
+  case h1: DiscreteHistogram = {
+val finalMap = new mutable.HashMap[Double, Int]()
+data.iterator.foreach(x = finalMap.put(x._1, x._2))
+h1.data.iterator.foreach(x = {
+  finalMap.get(x._1) match {
+case None = finalMap.put(x._1, x._2)
+case Some(value) = finalMap.update(x._1, x._2 + value)
+  }
+})
+require(B == 0 || finalMap.size = B, Insufficient capacity. 
Failed to merge)
+val finalSize = if (B  0) B else finalMap.size
+val ret = new DiscreteHistogram(finalSize)
+ret.loadData(finalMap.toArray)
+ret
+  }
+  case default =
+throw new RuntimeException(Only a discrete histogram is allowed 
to be merged with a  +
+  discrete histogram)
+}
+  }
+
+  /** Number of elements in category c
+*
+* @return Number of points in category c
+*/
+  def count(c: Double): Int = {
+data.get(c) match {
+  case None = 0
+  case Some(value) = value
+}
+  }
+
+  /** Returns the total number of elements in the histogram
+*
+* @return total number of elements added so far
+*/
+  override def total: Int = data.values.sum
+
+  /** Returns the string representation of the histogram.
+*
+*/
+  override def toString: String = {
+sSize: + bins +   + data.toString
--- End diff --

Knowing that this is the discrete histogram would be nice.


 Implement an online histogram with Merging and equalization features
 

 Key: FLINK-2030
 URL: https://issues.apache.org/jira/browse/FLINK-2030
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library

  1   2   >