[GitHub] flink issue #5769: [FLINK-9069] Add checkstyle rule to detect multiple conse...

2018-03-30 Thread jparkie
Github user jparkie commented on the issue:

https://github.com/apache/flink/pull/5769
  
@zentol Bump. Just wondering if this got lost. :)


---


[GitHub] flink pull request #5769: [FLINK-9069] Add checkstyle rule to detect multipl...

2018-03-26 Thread jparkie
GitHub user jparkie opened a pull request:

https://github.com/apache/flink/pull/5769

[FLINK-9069] Add checkstyle rule to detect multiple consecutive semicolons

## What is the purpose of the change

This pull request introduces a new checkstyle rule that detects multiple 
multiple consecutive semicolons like those at the end of a Java statement.

## Brief change log

  - Add new checkstyle rule.
  - Fix violations of new checkstyle rule.

## Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jparkie/flink 
FLINK-9069_jparkie_fix_double_semicolons

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5769.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5769


commit 61d82b0f51b7f6506b2802c30c56105ae6cc
Author: jparkie <park.jacob.96@...>
Date:   2018-03-26T01:09:25Z

[FLINK-9069] Add checkstyle rule to detect multiple consecutive semicolons




---


[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140833539
  
--- Diff: 
flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/HyperLogLog.java
 ---
@@ -0,0 +1,333 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.aggfunctions.cardinality;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInput;
+import java.io.DataInputStream;
+import java.io.DataOutput;
+import java.io.DataOutputStream;
+import java.io.Externalizable;
+import java.io.IOException;
+import java.io.ObjectInput;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutput;
+import java.io.Serializable;
+
+/**
+ * Java implementation of HyperLogLog (HLL) algorithm from this paper:
+ * 
+ * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
+ * 
+ * HLL is an improved version of LogLog that is capable of estimating
+ * the cardinality of a set with accuracy = 1.04/sqrt(m) where
+ * m = 2^b.  So we can control accuracy vs space usage by increasing
+ * or decreasing b.
+ * 
+ * The main benefit of using HLL over LL is that it only requires 64%
+ * of the space that LL does to get the same accuracy.
+ * 
+ * 
+ * Note that this implementation does not include the long range 
correction function
+ * defined in the original paper.  Empirical evidence shows that the 
correction
+ * function causes more harm than good.
+ * 
+ */
+public class HyperLogLog implements ICardinality, Serializable {
+
+   private final RegisterSet registerSet;
+   private final int log2m;
+   private final double alphaMM;
+
+
+   /**
+* Create a new HyperLogLog instance using the specified standard 
deviation.
+*
+* @param rsd - the relative standard deviation for the counter.
+*smaller values create counters that require more space.
+*/
+   public HyperLogLog(double rsd) {
+   this(log2m(rsd));
+   }
+
+   private static int log2m(double rsd) {
+   return (int) (Math.log((1.106 / rsd) * (1.106 / rsd)) / 
Math.log(2));
+   }
+
+   private static double rsd(int log2m) {
+   return 1.106 / Math.sqrt(Math.exp(log2m * Math.log(2)));
+   }
+
+   private static void validateLog2m(int log2m) {
+   if (log2m < 0 || log2m > 30) {
+   throw new IllegalArgumentException("log2m argument is "
+   + log2m + " and is outside the range 
[0, 30]");
+   }
+   }
+
+   private static double linearCounting(int m, double v) {
+   return m * Math.log(m / v);
+   }
+
+   /**
+* Create a new HyperLogLog instance.  The log2m parameter defines the 
accuracy of
+* the counter.  The larger the log2m the better the accuracy.
+* 
+* accuracy = 1.04/sqrt(2^log2m)
+*
+* @param log2m - the number of bits to use as the basis for the HLL 
instance
+*/
+   public HyperLogLog(int log2m) {
+   this(log2m, new RegisterSet(1 << log2m));
+   }
+
+   /**
+* Creates a new HyperLogLog instance using the given registers.  Used 
for unmarshalling a serialized
+* instance and for merging multiple counters together.
+*
+* @param registerSet - the initial values for the register set
+*/
+   public HyperLogLog(int log2m, RegisterSet registerSet) {
+   validateLog2m(log2m);
+   this.registerSet = registerSet;
+   this.log2m = log2m;
+   int m = 1 << this.log2m;
+
+   alphaMM = getAlphaMM(log2m, m);
+   }
+
+   @Override
+   public boolean offerHashed(long hashedValue) {
+  

[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140832701
  
--- Diff: 
flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/HyperLogLog.java
 ---
@@ -0,0 +1,333 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.aggfunctions.cardinality;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInput;
+import java.io.DataInputStream;
+import java.io.DataOutput;
+import java.io.DataOutputStream;
+import java.io.Externalizable;
+import java.io.IOException;
+import java.io.ObjectInput;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutput;
+import java.io.Serializable;
+
+/**
+ * Java implementation of HyperLogLog (HLL) algorithm from this paper:
+ * 
+ * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
+ * 
+ * HLL is an improved version of LogLog that is capable of estimating
+ * the cardinality of a set with accuracy = 1.04/sqrt(m) where
+ * m = 2^b.  So we can control accuracy vs space usage by increasing
+ * or decreasing b.
+ * 
+ * The main benefit of using HLL over LL is that it only requires 64%
+ * of the space that LL does to get the same accuracy.
+ * 
+ * 
+ * Note that this implementation does not include the long range 
correction function
+ * defined in the original paper.  Empirical evidence shows that the 
correction
+ * function causes more harm than good.
+ * 
+ */
+public class HyperLogLog implements ICardinality, Serializable {
--- End diff --

I see this class is adapted from https://github.com/addthis/stream-lib. I 
think you should comment that this class was adapted from the link, so people 
can track differences.


---


[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140830364
  
--- Diff: 
flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/ICardinality.java
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.aggfunctions.cardinality;
+
+import java.io.IOException;
+
+/**
+ * An interface definition for implementation of cardinality.
+ */
+public interface ICardinality {
--- End diff --

Why is the interface named `ICardinlaity`? Is this common in the Flink 
codebase? I know prefixing "I" is common in C#.


---


[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140828795
  
--- Diff: 
flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/ICardinality.java
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.aggfunctions.cardinality;
+
+import java.io.IOException;
+
+/**
+ * An interface definition for implementation of cardinality.
+ */
+public interface ICardinality {
+
+   /**
+* Check whether the element is impact estimate.
+*
+* @param o stream element
+* @return false if the value returned by cardinality() is unaffected 
by the appearance of o in the stream.
+*/
+   boolean offer(Object o);
+
+   /**
+* Offer the value as a hashed long value.
+*
+* @param hashedLong - the hash of the item to offer to the estimator
+* @return false if the value returned by cardinality() is unaffected 
by the appearance of hashedLong in the stream
+*/
+   boolean offerHashed(long hashedLong);
+
+   /**
+* Offer the value as a hashed long value.
+*
+* @param hashedInt - the hash of the item to offer to the estimator
+* @return false if the value returned by cardinality() is unaffected 
by the appearance of hashedInt in the stream
+*/
+   boolean offerHashed(int hashedInt);
+
+   /**
+* @return the number of unique elements in the stream or an estimate 
thereof.
+*/
+   long cardinality();
+
+   /**
+* @return size in bytes needed for serialization.
+*/
+   int sizeof();
+
+   /**
+* Get the byte array used for the calculation.
+*
+* @return The byte array used for the calculation
+* @throws IOException
+*/
+   byte[] getBytes() throws IOException;
+
+   /**
+* Merges estimators to produce a new estimator for the combined streams
+* of this estimator and those passed as arguments.
+* 
+* Nor this estimator nor the one passed as parameters are modified.
+*
+* @param estimators Zero or more compatible estimators
+* @throws Exception If at least one of the estimators is not 
compatible with this one
+*/
+   ICardinality merge(ICardinality... estimators) throws Exception;
--- End diff --

Wouldn't it be nicer to have a more specific Exception?


---


[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140826158
  
--- Diff: 
flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/MurmurHash.java
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.aggfunctions.cardinality;
+
+/**
+ * This is a very fast, non-cryptographic hash suitable for general 
hash-based
+ * lookup. See http://murmurhash.googlepages.com/ for more details.
+ * 
+ */
+public class MurmurHash {
--- End diff --

What was the decision in choosing MurmurHash2 (in 
https://github.com/tnm/murmurhash-java) versus MurmurHash3 (in 
https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/Murmur3_32HashFunction.java)?

If I recall correctly, MurmurHash3 is more collision resistant and slightly 
faster than MurmurHash2.


---


[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...

2017-09-25 Thread jparkie
Github user jparkie commented on a diff in the pull request:

https://github.com/apache/flink/pull/4652#discussion_r140823213
  
--- Diff: docs/dev/table/sql.md ---
@@ -2020,7 +2020,16 @@ COUNT(*)
 Returns the number of input rows.
   
 
-
+
+  
+{% highlight text %}
+CARDINALITY_COUNT(rsd, value)
--- End diff --

Would it be clearer to the user to have the function have the word 
"approximate" in it such that the user understands the count is an estimate? I 
see Apache Spark calls it 
`approx_count_distinct`(https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html#approx_count_distinct-org.apache.spark.sql.Column-double-)
 and Redshift has it as `APPROXIMATE COUNT(DISTINCT column)` 
(http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html).


---