[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070907144


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
 return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   I have added `canMergeFrom` method, please take a look if it is suitable~



##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
 return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null 
BloomFilter");
+Preconditions.checkArgument((getAlgorithm() == 
otherBloomFilter.getAlgorithm()),
+  "BloomFilters must have the same algorithm (%s != %s)",
+getAlgorithm(), otherBloomFilter.getAlgorithm());
+Preconditions.checkArgument((getHashStrategy() == 
otherBloomFilter.getHashStrategy()),
+  "BloomFilters must have the same hashStrategy (%s != %s)",
+getHashStrategy(), otherBloomFilter.getHashStrategy());
+Preconditions.checkArgument((getBitsetSize() == 
otherBloomFilter.getBitsetSize()),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070638035


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,83 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeEmptyBloomFilter() throws IOException {

Review Comment:
   I added a test for two BFs are not compatible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the 
test by writing two lists of hash values to create two BFs. Then we are pretty 
sure the test result of each value. What do you think?
   
   Good idea, I removed the random number logic~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the 
test by writing two lists of hash values to create two BFs. Then we are pretty 
sure the test result of each value. What do you think?
   
   Good idea, I removed the random number logic and change~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the 
test by writing two lists of hash values to create two BFs. Then we are pretty 
sure the test result of each value. What do you think?
   
   I removed the random number logic and change~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637029


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
 return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null 
BloomFilter");
+Preconditions.checkArgument((getAlgorithm() == 
otherBloomFilter.getAlgorithm()),
+  String.format("BloomFilters must have the same algorithm (%s != %s)",
+getAlgorithm(), otherBloomFilter.getAlgorithm()));
+Preconditions.checkArgument((getHashStrategy() == 
otherBloomFilter.getHashStrategy()),
+  String.format("BloomFilters must have the same hashStrategy (%s != %s)",
+getHashStrategy(), otherBloomFilter.getHashStrategy()));
+Preconditions.checkArgument((getBitsetSize() == 
otherBloomFilter.getBitsetSize()),
+  String.format("BloomFilters must have the same size of bitsets (%s != 
%s)",
+getBitsetSize(), otherBloomFilter.getBitsetSize()));
+ByteArrayOutputStream otherOutputStream = new ByteArrayOutputStream();
+otherBloomFilter.writeTo(otherOutputStream);
+byte[] otherBits = otherOutputStream.toByteArray();

Review Comment:
   I had checked the `getBitsetSize` before, so we may don't need to checke the 
bitset.length ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070636680


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -398,18 +398,21 @@ public long hash(Binary value) {
 
   @Override
   public void merge(BloomFilter otherBloomFilter) throws IOException {
-Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == 
getAlgorithm()),
-  "BloomFilter algorithm should be same");
-Preconditions.checkArgument((otherBloomFilter.getHashStrategy() == 
getHashStrategy()),
-  "BloomFilter hashStrategy should be same");
-Preconditions.checkArgument((otherBloomFilter.getBitsetSize() == 
getBitsetSize()),
-  "BloomFilter bitset size should be same");
+Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null 
BloomFilter");
+Preconditions.checkArgument((getAlgorithm() == 
otherBloomFilter.getAlgorithm()),

Review Comment:
   Thanks, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.
   
   I add two different test case, I am not sure if I need to add some more.



##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.
   
   I add two different test cases, I am not sure if I need to add some more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact 

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value.
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   Because I added  random value, if the BloomFilter to be merged is not empty, 
there is a small probability that the two BloomFilters will be inconsistent 
when judging whether there is a hash value.
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();
+Set testStrings = new HashSet<>();
+Set testInts = new HashSet<>();
+Set testDoubles = new HashSet<>();
+Set testFloats = new HashSet<>();
+for (int i = 0; i < 1024; i++) {
+
+  String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+  originStrings.add(originStrValue);
+  
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+  String testString = RandomStringUtils.randomAlphabetic(1, 64);
+  testStrings.add(testString);
+  
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+  int testInt = random.nextInt();
+  testInts.add(testInt);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+  double testDouble = random.nextDouble();
+  testDoubles.add(testDouble);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+  float testFloat = random.nextFloat();
+  testFloats.add(testFloat);
+  otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+}
+mergedBloomFilter.merge(otherBloomFilter);
+for (String testString : originStrings) {
+  
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString;

Review Comment:
   Because I added  random value, if the BloomFilter to be merged is not empty, 
there is a small probability that the two BloomFilters will be inconsistent 
when judging whether there is a hash value.
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590290


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
 return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == 
getAlgorithm()),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590238


##
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
 assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+Random random = new Random();
+int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+Set originStrings = new HashSet<>();

Review Comment:
   I have splitted different types with different parameter, I am not sure if 
is suitable.



##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##
@@ -176,4 +176,13 @@ public String toString() {
* @return compress algorithm that the bloom filter apply
*/
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a 
bitwise OR of the underlying bits
+   *
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter 
with.
+   */
+  default void merge(BloomFilter otherBloomFilter) throws IOException {
+throw new UnsupportedOperationException("Not supported merge operation.");

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##
@@ -176,4 +176,10 @@ public String toString() {
* @return compress algorithm that the bloom filter apply
*/
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a 
bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter 
with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   > IMHO, user needs to know if two BFs are compatible to merge before calling 
this function. So an utility function to test compatibility of two BFs is also 
required. WDYT?
   
   Thanks for your review. I had added some check before merging bloomFilter. I 
see that guava does not seem to have additional utility function.If necessary, 
I can add an utility function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267080


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
 return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void putAll(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   > Could you add some tests to verify if the merged BF is as expected?
   
   Sure, I will add some UT later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##
@@ -176,4 +176,10 @@ public String toString() {
* @return compress algorithm that the bloom filter apply
*/
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a 
bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter 
with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   > IMHO, user needs to know if two BFs are compatible to merge before calling 
this function. So an utility function to test compatibility of two BFs is also 
required. WDYT?
   
   I had added some check before merging bloomFilter. I see that guava does not 
seem to have additional utility function.If necessary, I can add an utility 
function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org