[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-12-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15917


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88461735
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
--- End diff --

it is only need to calculate expectedFpp (false positive rate) which is not 
very frequent operation, I agree. will remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88460248
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

In fact the is Murmur hash version which produces 64 - it is called Murmur 
2 [Wiki](https://en.wikipedia.org/wiki/MurmurHash), but it is previous version 
of Murmur, current is Murmur 3 which is faster and give better 
distribution(Murmur2 has some flaws). Since in bloom filter we always need 1 or 
more hash functions, creating 128 bit hash(two longs) is pretty good I suppose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88456816
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
@@ -233,4 +233,5 @@ public static BloomFilter create(long expectedNumItems, 
long numBits) {
 
 return new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, 
numBits), numBits);
   }
+
--- End diff --

Go ahead and back out this spurious change; there's another newline at the 
end of the next file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455345
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
--- End diff --

It's certainly worth documenting this in the scaladoc BTW, why the class 
has to exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88457983
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

OK, maybe that makes sense. If bothering with a custom impl, why not 64 
bits instead of 128? it would avoid needing a long[2] to take the result?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455455
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
+private final long numBits; // total number of available bits
+
+private static int numOfBuckets(long numBits) {
+if (numBits <= 0) {
+throw new IllegalArgumentException("numBits must be positive, 
but got " + numBits);
+}
+return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
+}
+
+private static RoaringBitmap[] initialVector(int numOfBuckets){
--- End diff --

Small style things: space before braces, and after casts, as in a few lines 
above. Spaces around elements of a for loop, as below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455531
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
+private final long numBits; // total number of available bits
+
+private static int numOfBuckets(long numBits) {
+if (numBits <= 0) {
+throw new IllegalArgumentException("numBits must be positive, 
but got " + numBits);
+}
+return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
+}
+
+private static RoaringBitmap[] initialVector(int numOfBuckets){
+RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets];
+for(int i = 0;i

[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88456006
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
--- End diff --

Is it worth maintaining this? it slows down paths like set; is it called 
frequently enough to cache?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455844
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
+private final long numBits; // total number of available bits
+
+private static int numOfBuckets(long numBits) {
+if (numBits <= 0) {
+throw new IllegalArgumentException("numBits must be positive, 
but got " + numBits);
+}
+return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
+}
+
+private static RoaringBitmap[] initialVector(int numOfBuckets){
+RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets];
+for(int i = 0;i

[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455796
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
+private final long numBits; // total number of available bits
+
+private static int numOfBuckets(long numBits) {
+if (numBits <= 0) {
+throw new IllegalArgumentException("numBits must be positive, 
but got " + numBits);
+}
+return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
+}
+
+private static RoaringBitmap[] initialVector(int numOfBuckets){
+RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets];
+for(int i = 0;i

[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88455809
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
+
+private final RoaringBitmap[] data;
+private long bitCount; // number of 1`s in bitset
+private final long numBits; // total number of available bits
+
+private static int numOfBuckets(long numBits) {
+if (numBits <= 0) {
+throw new IllegalArgumentException("numBits must be positive, 
but got " + numBits);
+}
+return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
+}
+
+private static RoaringBitmap[] initialVector(int numOfBuckets){
+RoaringBitmap[] vector = new RoaringBitmap[numOfBuckets];
+for(int i = 0;i

[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88441473
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

Well Guava`s implementation is allocating a lot of temp objects(works only 
with ByteBuffer 
[source](https://github.com/google/guava/blob/d330fc7f880b49e9d45a44e7650d95a6f7afbe23/guava/src/com/google/common/hash/Murmur3_128HashFunction.java)).
 So I decided to do standalone implementation - all tests for Hasher are taken 
from Guava lib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88441043
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
--- End diff --

well, the main point of this improvement is to save memory - now even 
filters with few elements acquire space needed for full bloom filter. With this 
improvement bloom filter size will increase as we insert values in it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88440867
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

Rather than make a new implementation, use Guava's Hashing? Scala and Spark 
itself already have murmur hash impls but they seem 32-bit only


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88440366
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

This decision come from fact that RoaringBitmap supports only int 
indexes(and can have only Int.MaxValue bits), since we use RoaringBitmap[] to 
represent vector way larger than Int.MaxValue, we need good hash function to 
generate long hash values. This implementation is based on 
[smhasher](https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp)
  and [solr 
implementation](https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java)
 which are both freely available I suppose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88440268
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
--- End diff --

Hm, I see: https://github.com/RoaringBitmap/RoaringBitmap/issues/109
I wonder if it's worth using RoaringBitmap then, but, of course making the 
improvements we want from RB would take work too, and, removing code helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
Github user ponkin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88439826
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
--- End diff --

Unfortunately, RoaringBitmap supports only int indexes 
[javadoc](http://www.javadoc.io/doc/org.roaringbitmap/RoaringBitmap/0.6.27). 
Since we need to support bloom filters with size more than 
Int.MaxValue(BitArray supported 64*IntMaxValue) I decided to partition bloom 
filter with RoaringBitmap[]. Now Bloom filter can be Long.MaxValue large in 
size(number of bits).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88434278
  
--- Diff: 
common/sketch/src/test/java/org/apache/spark/util/sketch/Murmur3_128Suite.java 
---
@@ -0,0 +1,30 @@
+package org.apache.spark.util.sketch;
--- End diff --

Missing a license


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88434152
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_128.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * 128-bit Murmur3 hasher.
+ * Best perfomance is on x86_64 platform
+ * Based on implementation 
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
+ */
+final class Murmur3_128 {
--- End diff --

Could you explain why we need to add a custom murmurhash implementation? it 
seems like conceptually we're just swapping bitmaps for bitmapss. What's the 
license of this code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15917#discussion_r88434426
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/RoaringBitmapArray.java
 ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.roaringbitmap.RoaringBitmap;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+
+class RoaringBitmapArray {
--- End diff --

Probably a dumb question but why do we need an array of bitmaps? I thought 
RoaringBitmap was a BitArray counterpart?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15917: SPARK-18252: Using RoaringBitmap for bloom filter...

2016-11-17 Thread ponkin
GitHub user ponkin opened a pull request:

https://github.com/apache/spark/pull/15917

SPARK-18252: Using RoaringBitmap for bloom filters

## What changes were proposed in this pull request?

+ Bloom filter is now using RoaringBitmap as bit vector - to save memory 
usage
+ Bloom filter is using Murmur 128 bit hash

## How was this patch tested?

All previous tests are valid
Added more accurate test for Byte type

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ponkin/spark SPARK-18252

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15917.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15917


commit 2cd92c344aabd6cb93a2c14122149f9e2ad1884c
Author: Alexey Ponkin 
Date:   2016-11-14T08:06:08Z

SPARK-18252: Using RoaringBitmap for bloom filters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org