[GitHub] flink issue #5769: [FLINK-9069] Add checkstyle rule to detect multiple conse...
Github user jparkie commented on the issue: https://github.com/apache/flink/pull/5769 @zentol Bump. Just wondering if this got lost. :) ---
[GitHub] flink pull request #5769: [FLINK-9069] Add checkstyle rule to detect multipl...
GitHub user jparkie opened a pull request: https://github.com/apache/flink/pull/5769 [FLINK-9069] Add checkstyle rule to detect multiple consecutive semicolons ## What is the purpose of the change This pull request introduces a new checkstyle rule that detects multiple multiple consecutive semicolons like those at the end of a Java statement. ## Brief change log - Add new checkstyle rule. - Fix violations of new checkstyle rule. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jparkie/flink FLINK-9069_jparkie_fix_double_semicolons Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5769.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5769 commit 61d82b0f51b7f6506b2802c30c56105ae6cc Author: jparkie <park.jacob.96@...> Date: 2018-03-26T01:09:25Z [FLINK-9069] Add checkstyle rule to detect multiple consecutive semicolons ---
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140833539 --- Diff: flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/HyperLogLog.java --- @@ -0,0 +1,333 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.functions.aggfunctions.cardinality; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.DataInput; +import java.io.DataInputStream; +import java.io.DataOutput; +import java.io.DataOutputStream; +import java.io.Externalizable; +import java.io.IOException; +import java.io.ObjectInput; +import java.io.ObjectInputStream; +import java.io.ObjectOutput; +import java.io.Serializable; + +/** + * Java implementation of HyperLogLog (HLL) algorithm from this paper: + * + * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf + * + * HLL is an improved version of LogLog that is capable of estimating + * the cardinality of a set with accuracy = 1.04/sqrt(m) where + * m = 2^b. So we can control accuracy vs space usage by increasing + * or decreasing b. + * + * The main benefit of using HLL over LL is that it only requires 64% + * of the space that LL does to get the same accuracy. + * + * + * Note that this implementation does not include the long range correction function + * defined in the original paper. Empirical evidence shows that the correction + * function causes more harm than good. + * + */ +public class HyperLogLog implements ICardinality, Serializable { + + private final RegisterSet registerSet; + private final int log2m; + private final double alphaMM; + + + /** +* Create a new HyperLogLog instance using the specified standard deviation. +* +* @param rsd - the relative standard deviation for the counter. +*smaller values create counters that require more space. +*/ + public HyperLogLog(double rsd) { + this(log2m(rsd)); + } + + private static int log2m(double rsd) { + return (int) (Math.log((1.106 / rsd) * (1.106 / rsd)) / Math.log(2)); + } + + private static double rsd(int log2m) { + return 1.106 / Math.sqrt(Math.exp(log2m * Math.log(2))); + } + + private static void validateLog2m(int log2m) { + if (log2m < 0 || log2m > 30) { + throw new IllegalArgumentException("log2m argument is " + + log2m + " and is outside the range [0, 30]"); + } + } + + private static double linearCounting(int m, double v) { + return m * Math.log(m / v); + } + + /** +* Create a new HyperLogLog instance. The log2m parameter defines the accuracy of +* the counter. The larger the log2m the better the accuracy. +* +* accuracy = 1.04/sqrt(2^log2m) +* +* @param log2m - the number of bits to use as the basis for the HLL instance +*/ + public HyperLogLog(int log2m) { + this(log2m, new RegisterSet(1 << log2m)); + } + + /** +* Creates a new HyperLogLog instance using the given registers. Used for unmarshalling a serialized +* instance and for merging multiple counters together. +* +* @param registerSet - the initial values for the register set +*/ + public HyperLogLog(int log2m, RegisterSet registerSet) { + validateLog2m(log2m); + this.registerSet = registerSet; + this.log2m = log2m; + int m = 1 << this.log2m; + + alphaMM = getAlphaMM(log2m, m); + } + + @Override + public boolean offerHashed(long hashedValue) { +
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140832701 --- Diff: flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/HyperLogLog.java --- @@ -0,0 +1,333 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.functions.aggfunctions.cardinality; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.DataInput; +import java.io.DataInputStream; +import java.io.DataOutput; +import java.io.DataOutputStream; +import java.io.Externalizable; +import java.io.IOException; +import java.io.ObjectInput; +import java.io.ObjectInputStream; +import java.io.ObjectOutput; +import java.io.Serializable; + +/** + * Java implementation of HyperLogLog (HLL) algorithm from this paper: + * + * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf + * + * HLL is an improved version of LogLog that is capable of estimating + * the cardinality of a set with accuracy = 1.04/sqrt(m) where + * m = 2^b. So we can control accuracy vs space usage by increasing + * or decreasing b. + * + * The main benefit of using HLL over LL is that it only requires 64% + * of the space that LL does to get the same accuracy. + * + * + * Note that this implementation does not include the long range correction function + * defined in the original paper. Empirical evidence shows that the correction + * function causes more harm than good. + * + */ +public class HyperLogLog implements ICardinality, Serializable { --- End diff -- I see this class is adapted from https://github.com/addthis/stream-lib. I think you should comment that this class was adapted from the link, so people can track differences. ---
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140830364 --- Diff: flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/ICardinality.java --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.functions.aggfunctions.cardinality; + +import java.io.IOException; + +/** + * An interface definition for implementation of cardinality. + */ +public interface ICardinality { --- End diff -- Why is the interface named `ICardinlaity`? Is this common in the Flink codebase? I know prefixing "I" is common in C#. ---
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140828795 --- Diff: flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/ICardinality.java --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.functions.aggfunctions.cardinality; + +import java.io.IOException; + +/** + * An interface definition for implementation of cardinality. + */ +public interface ICardinality { + + /** +* Check whether the element is impact estimate. +* +* @param o stream element +* @return false if the value returned by cardinality() is unaffected by the appearance of o in the stream. +*/ + boolean offer(Object o); + + /** +* Offer the value as a hashed long value. +* +* @param hashedLong - the hash of the item to offer to the estimator +* @return false if the value returned by cardinality() is unaffected by the appearance of hashedLong in the stream +*/ + boolean offerHashed(long hashedLong); + + /** +* Offer the value as a hashed long value. +* +* @param hashedInt - the hash of the item to offer to the estimator +* @return false if the value returned by cardinality() is unaffected by the appearance of hashedInt in the stream +*/ + boolean offerHashed(int hashedInt); + + /** +* @return the number of unique elements in the stream or an estimate thereof. +*/ + long cardinality(); + + /** +* @return size in bytes needed for serialization. +*/ + int sizeof(); + + /** +* Get the byte array used for the calculation. +* +* @return The byte array used for the calculation +* @throws IOException +*/ + byte[] getBytes() throws IOException; + + /** +* Merges estimators to produce a new estimator for the combined streams +* of this estimator and those passed as arguments. +* +* Nor this estimator nor the one passed as parameters are modified. +* +* @param estimators Zero or more compatible estimators +* @throws Exception If at least one of the estimators is not compatible with this one +*/ + ICardinality merge(ICardinality... estimators) throws Exception; --- End diff -- Wouldn't it be nicer to have a more specific Exception? ---
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140826158 --- Diff: flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/functions/aggfunctions/cardinality/MurmurHash.java --- @@ -0,0 +1,247 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.functions.aggfunctions.cardinality; + +/** + * This is a very fast, non-cryptographic hash suitable for general hash-based + * lookup. See http://murmurhash.googlepages.com/ for more details. + * + */ +public class MurmurHash { --- End diff -- What was the decision in choosing MurmurHash2 (in https://github.com/tnm/murmurhash-java) versus MurmurHash3 (in https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/Murmur3_32HashFunction.java)? If I recall correctly, MurmurHash3 is more collision resistant and slightly faster than MurmurHash2. ---
[GitHub] flink pull request #4652: [FLINK-7465][table]Add cardinality count for table...
Github user jparkie commented on a diff in the pull request: https://github.com/apache/flink/pull/4652#discussion_r140823213 --- Diff: docs/dev/table/sql.md --- @@ -2020,7 +2020,16 @@ COUNT(*) Returns the number of input rows. - + + +{% highlight text %} +CARDINALITY_COUNT(rsd, value) --- End diff -- Would it be clearer to the user to have the function have the word "approximate" in it such that the user understands the count is an estimate? I see Apache Spark calls it `approx_count_distinct`(https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html#approx_count_distinct-org.apache.spark.sql.Column-double-) and Redshift has it as `APPROXIMATE COUNT(DISTINCT column)` (http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html). ---