[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15917 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88461735 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset --- End diff -- it is only need to calculate expectedFpp (false positive rate) which is not very frequent operation, I agree. will remove --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88460248 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- In fact the is Murmur hash version which produces 64 - it is called Murmur 2 [Wiki](https://en.wikipedia.org/wiki/MurmurHash), but it is previous version of Murmur, current is Murmur 3 which is faster and give better distribution(Murmur2 has some flaws). Since in bloom filter we always need 1 or more hash functions, creating 128 bit hash(two longs) is pretty good I suppose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88456816 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java --- @@ -233,4 +233,5 @@ public static BloomFilter create(long expectedNumItems, long numBits) { return new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, numBits), numBits); } + --- End diff -- Go ahead and back out this spurious change; there's another newline at the end of the next file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455345 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { --- End diff -- It's certainly worth documenting this in the scaladoc BTW, why the class has to exist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88457983 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- OK, maybe that makes sense. If bothering with a custom impl, why not 64 bits instead of 128? it would avoid needing a long[2] to take the result? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455455 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset +private final long numBits; // total number of available bits + +private static int numOfBuckets(long numBits) { +if (numBits <= 0) { +throw new IllegalArgumentException("numBits must be positive, but got " + numBits); +} +return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); +} + +private static RoaringBitmap[] initialVector(int numOfBuckets){ --- End diff -- Small style things: space before braces, and after casts, as in a few lines above. Spaces around elements of a for loop, as below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455531 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset +private final long numBits; // total number of available bits + +private static int numOfBuckets(long numBits) { +if (numBits <= 0) { +throw new IllegalArgumentException("numBits must be positive, but got " + numBits); +} +return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); +} + +private static RoaringBitmap[] initialVector(int numOfBuckets){ +RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets]; +for(int i = 0;i
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88456006 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset --- End diff -- Is it worth maintaining this? it slows down paths like set; is it called frequently enough to cache? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455844 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset +private final long numBits; // total number of available bits + +private static int numOfBuckets(long numBits) { +if (numBits <= 0) { +throw new IllegalArgumentException("numBits must be positive, but got " + numBits); +} +return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); +} + +private static RoaringBitmap[] initialVector(int numOfBuckets){ +RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets]; +for(int i = 0;i
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455796 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset +private final long numBits; // total number of available bits + +private static int numOfBuckets(long numBits) { +if (numBits <= 0) { +throw new IllegalArgumentException("numBits must be positive, but got " + numBits); +} +return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); +} + +private static RoaringBitmap[] initialVector(int numOfBuckets){ +RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets]; +for(int i = 0;i
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88455809 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { + +private final RoaringBitmap[] data; +private long bitCount; // number of 1`s in bitset +private final long numBits; // total number of available bits + +private static int numOfBuckets(long numBits) { +if (numBits <= 0) { +throw new IllegalArgumentException("numBits must be positive, but got " + numBits); +} +return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); +} + +private static RoaringBitmap[] initialVector(int numOfBuckets){ +RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets]; +for(int i = 0;i
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88441473 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- Well Guava`s implementation is allocating a lot of temp objects(works only with ByteBuffer [source](https://github.com/google/guava/blob/d330fc7f880b49e9d45a44e7650d95a6f7afbe23/guava/src/com/google/common/hash/Murmur3_128HashFunction.java)). So I decided to do standalone implementation - all tests for Hasher are taken from Guava lib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88441043 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { --- End diff -- well, the main point of this improvement is to save memory - now even filters with few elements acquire space needed for full bloom filter. With this improvement bloom filter size will increase as we insert values in it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88440867 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- Rather than make a new implementation, use Guava's Hashing? Scala and Spark itself already have murmur hash impls but they seem 32-bit only --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88440366 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- This decision come from fact that RoaringBitmap supports only int indexes(and can have only Int.MaxValue bits), since we use RoaringBitmap[] to represent vector way larger than Int.MaxValue, we need good hash function to generate long hash values. This implementation is based on [smhasher](https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp) and [solr implementation](https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java) which are both freely available I suppose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88440268 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { --- End diff -- Hm, I see: https://github.com/RoaringBitmap/RoaringBitmap/issues/109 I wonder if it's worth using RoaringBitmap then, but, of course making the improvements we want from RB would take work too, and, removing code helps. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user ponkin commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88439826 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { --- End diff -- Unfortunately, RoaringBitmap supports only int indexes [javadoc](http://www.javadoc.io/doc/org.roaringbitmap/RoaringBitmap/0.6.27). Since we need to support bloom filters with size more than Int.MaxValue(BitArray supported 64*IntMaxValue) I decided to partition bloom filter with RoaringBitmap[]. Now Bloom filter can be Long.MaxValue large in size(number of bits). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88434278 --- Diff: common/sketch/src/test/java/org/apache/spark/util/sketch/Murmur3_128Suite.java --- @@ -0,0 +1,30 @@ +package org.apache.spark.util.sketch; --- End diff -- Missing a license --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88434152 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * 128-bit Murmur3 hasher. + * Best perfomance is on x86_64 platform + * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp + */ +final class Murmur3_128 { --- End diff -- Could you explain why we need to add a custom murmurhash implementation? it seems like conceptually we're just swapping bitmaps for bitmapss. What's the license of this code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15917#discussion_r88434426 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.roaringbitmap.RoaringBitmap; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; + +class RoaringBitmapArray { --- End diff -- Probably a dumb question but why do we need an array of bitmaps? I thought RoaringBitmap was a BitArray counterpart? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...
GitHub user ponkin opened a pull request: https://github.com/apache/spark/pull/15917 SPARK-18252: Using RoaringBitmap for bloom filters ## What changes were proposed in this pull request? + Bloom filter is now using RoaringBitmap as bit vector - to save memory usage + Bloom filter is using Murmur 128 bit hash ## How was this patch tested? All previous tests are valid Added more accurate test for Byte type Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ponkin/spark SPARK-18252 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15917 commit 2cd92c344aabd6cb93a2c14122149f9e2ad1884c Author: Alexey PonkinDate: 2016-11-14T08:06:08Z SPARK-18252: Using RoaringBitmap for bloom filters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org