[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-04-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715901#comment-17715901
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175528495


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);

Review Comment:
   sorry, can you expand on this a bit?





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2290) Add CI for Hadoop 2

2023-04-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716116#comment-17716116
 ] 

ASF GitHub Bot commented on PARQUET-2290:
-

Fokko opened a new pull request, #1083:
URL: https://github.com/apache/parquet-mr/pull/1083

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Add CI for Hadoop 2
> ---
>
> Key: PARQUET-2290
> URL: https://issues.apache.org/jira/browse/PARQUET-2290
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2290) Add CI for Hadoop 2

2023-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716298#comment-17716298
 ] 

ASF GitHub Bot commented on PARQUET-2290:
-

Fokko merged PR #1082:
URL: https://github.com/apache/parquet-mr/pull/1082




> Add CI for Hadoop 2
> ---
>
> Key: PARQUET-2290
> URL: https://issues.apache.org/jira/browse/PARQUET-2290
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2294) Bump fastutil from 8.4.2 to 8.5.12

2023-04-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716813#comment-17716813
 ] 

ASF GitHub Bot commented on PARQUET-2294:
-

Fokko merged PR #1080:
URL: https://github.com/apache/parquet-mr/pull/1080




> Bump fastutil from 8.4.2 to 8.5.12
> --
>
> Key: PARQUET-2294
> URL: https://issues.apache.org/jira/browse/PARQUET-2294
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2290) Add CI for Hadoop 2

2023-04-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716818#comment-17716818
 ] 

ASF GitHub Bot commented on PARQUET-2290:
-

Fokko merged PR #1083:
URL: https://github.com/apache/parquet-mr/pull/1083




> Add CI for Hadoop 2
> ---
>
> Key: PARQUET-2290
> URL: https://issues.apache.org/jira/browse/PARQUET-2290
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-04-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716867#comment-17716867
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1523900222

   > We usually expect implementations of new features in the format before the 
release to prove the related change. However I am not sure if the current 
unreleased changes need proof if they make sense as is.
   
   Thanks @gszadovszky I think maybe we should do a release before this change 
goes in, as it seems the most likely to run into implementation challenges?  
@wgtmac thoughts?




> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-04-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716946#comment-17716946
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

wgtmac commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1524415917

   > Thanks @gszadovszky I think maybe we should do a release before this 
change goes in, as it seems the most likely to run into implementation 
challenges? @wgtmac thoughts?
   
   That sounds good. Then what should we do once the POC implementation has 
been accepted? Release a new format version and then merge the implementation? 
IMHO, the bottom line is that release of new format should not be later than 
release of the associated implementation. Otherwise the release of parquet 
implementation may carry an unreleased `parquet.thrift` file.
   




> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717524#comment-17717524
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180097716


##
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##
@@ -97,7 +97,13 @@ abstract class ColumnWriterBase implements ColumnWriter {
   int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
   this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
 } else {
-  this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize, 
maxBloomFilterSize);
+  boolean useDynamicBloomFilter = props.getDynamicBloomFilterEnabled(path);
+  if(useDynamicBloomFilter) {

Review Comment:
   done~





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717527#comment-17717527
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180099251


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -503,6 +523,30 @@ public Builder withBloomFilterEnabled(boolean enabled) {
   return this;
 }
 
+/**
+ * Whether to use dynamic bloom filter to automatically adjust the bloom 
filter size according to
+ * `parquet.bloom.filter.max.bytes`.
+ * If NDV (number of distinct values) for a specified column is set, it 
will be ignored
+ *
+ * @param enabled whether to use dynamic bloom filter
+ */
+public Builder withDynamicBloomFilterEnabled(boolean enabled) {
+  this.dynamicBloomFilterEnabled.withDefaultValue(enabled);
+  return this;
+}
+
+/**
+ * When `DynamicBloomFilter` is enabled, set how many bloomFilters to 
split as candidates.

Review Comment:
   done





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717528#comment-17717528
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180102329


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -152,6 +153,8 @@ public static enum JobSummaryLevel {
   public static final String BLOOM_FILTER_EXPECTED_NDV = 
"parquet.bloom.filter.expected.ndv";
   public static final String BLOOM_FILTER_MAX_BYTES = 
"parquet.bloom.filter.max.bytes";
   public static final String BLOOM_FILTER_FPP = "parquet.bloom.filter.fpp";
+  public static final String DYNAMIC_BLOOM_FILTER_ENABLED = 
"parquet.bloom.filter.dynamic.enabled";

Review Comment:
   @wgtmac In the end, at least one BloomFilter will be retained, because I 
won't remove the `largestCandidate` (please see 
`AdaptiveBlockSplitBloomFilter#insertHash`)





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717533#comment-17717533
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180103459


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -152,6 +153,8 @@ public static enum JobSummaryLevel {
   public static final String BLOOM_FILTER_EXPECTED_NDV = 
"parquet.bloom.filter.expected.ndv";
   public static final String BLOOM_FILTER_MAX_BYTES = 
"parquet.bloom.filter.max.bytes";
   public static final String BLOOM_FILTER_FPP = "parquet.bloom.filter.fpp";
+  public static final String DYNAMIC_BLOOM_FILTER_ENABLED = 
"parquet.bloom.filter.dynamic.enabled";

Review Comment:
   I changed the name to `AdaptiveBlockSplitBloomFilter`, how do you feel?~





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717535#comment-17717535
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180108242


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -19,6 +19,11 @@
 package org.apache.parquet.hadoop;
 
 import static java.util.Arrays.asList;
+import static org.apache.parquet.schema.LogicalTypeAnnotation.stringType;

Review Comment:
   sorry, revert these changes





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717537#comment-17717537
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180109885


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {

Review Comment:
   done





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717536#comment-17717536
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180109383


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -152,6 +153,8 @@ public static enum JobSummaryLevel {
   public static final String BLOOM_FILTER_EXPECTED_NDV = 
"parquet.bloom.filter.expected.ndv";
   public static final String BLOOM_FILTER_MAX_BYTES = 
"parquet.bloom.filter.max.bytes";
   public static final String BLOOM_FILTER_FPP = "parquet.bloom.filter.fpp";
+  public static final String DYNAMIC_BLOOM_FILTER_ENABLED = 
"parquet.bloom.filter.dynamic.enabled";
+  public static final String BLOOM_FILTER_CANDIDATE_SIZE = 
"parquet.bloom.filter.candidate.size";

Review Comment:
   emmm, I want to align BloomFilter's configurations format 
`parquet.bloom.filter.xxx`





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717541#comment-17717541
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180114278


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {

Review Comment:
   I deleted some unimportant configurations and  I changed to name `numBytes` 
as `maximumBytes` (means  the maximum bit size of candidate)



##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific lan

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717544#comment-17717544
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180115504


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;

Review Comment:
   I changed to use `long`





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717543#comment-17717543
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180115266


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717545#comment-17717545
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180117673


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717547#comment-17717547
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180119535


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717549#comment-17717549
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180122796


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717561#comment-17717561
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180144336


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717579#comment-17717579
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180178705


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717583#comment-17717583
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180179878


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/DynamicBlockBloomFilter.java:
##
@@ -0,0 +1,314 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class DynamicBlockBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate maxCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private int distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
+this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  }
+
+  public DynamicBlockBloomFilter(int numBytes, int minimumBytes, int 
maximumBytes, HashStrategy hashStrategy,
+double fpp, int candidatesNum, ColumnDescriptor column) {
+if (minimumBytes > maximumBytes) {
+  throw new IllegalArgumentException("the minimum bytes should be less or 
equal than maximum bytes");
+}
+
+if (minimumBytes > LOWER_BOUND_BYTES && minimumBytes < UPPER_BOUND_BYTES) {
+  this.minimumBytes = minimumBytes;
+}
+
+if (maximumBytes > LOWER_BOUND_BYTES && maximumBytes < UPPER_BOUND_BYTES) {
+  this.maximumBytes = maximumBytes;
+}
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(numBytes, candidatesNum, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, genera

[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718305#comment-17718305
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1181768008


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   hi @gszadovszky ! Is there anything I can do to improve this PR?





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2281) Bump thrift to 0.18.1

2023-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718398#comment-17718398
 ] 

ASF GitHub Bot commented on PARQUET-2281:
-

Jimexist closed pull request #1012: PARQUET-2281 Bump thrift to 0.18.1
URL: https://github.com/apache/parquet-mr/pull/1012




> Bump thrift to 0.18.1
> -
>
> Key: PARQUET-2281
> URL: https://issues.apache.org/jira/browse/PARQUET-2281
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Reporter: Liu Jiayu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718485#comment-17718485
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182232703


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {
+final Field conversionsField = clazz.getDeclaredField("conversions");
+conversionsField.setAccessible(true);
+
+final Conversion[] conversions = (Conversion[]) 
conversionsField.get(null);
+
Arrays.stream(conversions).filter(Objects::nonNull).forEach(model::addLogicalTypeConversion);
+  }
+} catch (Exception e) {

Review Comment:
   Similar to the previous comment about logging and catching specific 
exceptions.



##
parquet-avro/src/test/java/org/apache/parquet/avro/TestSpecificReadWrite.java:
##
@@ -237,6 +242,43 @@ public void testAvroReadSchema() throws IOException {
 }
   }
 
+  @Test

Review Comment:
   We need to test both of `getModelForSchema` related to the avro version. If 
the version check gets more complicated, maybe more versions are to cover.



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   TBH I do not have a strong opinion on any. I am fine with the current one if 
it works.



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   Since we are using reflections on private members there are no compatibility 
guarantees. We shall be very careful here. What about avro versions prior to 
1.8? Also, what if it breaks in the future? Will the related unit test fail for 
a future Avro releases (in case of upgrading the Avro version in the pom)?



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+

[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718570#comment-17718570
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182538420


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {

Review Comment:
   that sounds good! I had planned to add logging originally, but I saw that 
there were 0 log statements in parquet-avro (and the slf4j dependencies for the 
module are 
[scoped](https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L76-L80)
 to `test`). So I'd have to import slf4j into the compile scope as well.





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718571#comment-17718571
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182544574


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   yeah - it's unfortunate that this isn't solvable without reflection. I've 
manually tested this with earlier versions of Avro... let me see if I can 
reshape them into automated tests for `parquet-avro` 👀 





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718672#comment-17718672
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182862929


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class byteBufferReadableClass = getReadableClass();
+  static final Constructor h2SeekableConstructor = 
getH2SeekableConstructor();

Review Comment:
   Rather than using two static methods, you can use `DynConstructors` instead 
to make this one expression and reduce error handling code:
   
   ```java
   private static final DynConstructors.Ctor h2streamCtor =
   DynConstructors.Builder(SeekableInputStream.class)
   .impl("org.apache.parquet.hadoop.util.H2SeekableInputStream", 
FSDataInputStream.class)
   .orNull()
   .build()
   
   ...
   if (h2streamCtor != null) {
 return h2streamCtor.newInstance(stream);
   }
   ```





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718676#comment-17718676
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182866610


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {

Review Comment:
   Added clearer logging & null-/error-handling.





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718675#comment-17718675
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182865579


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   ok, I've added a new test suite, `TestAvroRecordConverter`, that thoroughly 
tests `getModelForSchema` with a variety of schema variants and with Avro 
versions 1.{7,8,9,10,11}. To do this, I used Powermock to mock the static 
invocation used to get Avro runtime version, then I ad-hoc compiled Avro 
generated classes using compiler versions 1.{7,8,9,10,11}. Happy to take a 
different approach with testing if you prefer 

> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718678#comment-17718678
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182871206


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > Since we are using reflections on private members there are no 
compatibility guarantees. We shall be very careful here. What about avro 
versions prior to 1.8? Also, what if it breaks in the future? Will the related 
unit test fail for a future Avro releases (in case of upgrading the Avro 
version in the pom)?
   
   so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` 
field to hold all conversions, so I feel reasonably confident about relying on 
this. If that changes, we'll catch it in the new unit tests 👍 
   
   If you want, I can surround invocations of `getModelForSchema` in a 
try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default 
SpecificDataSupplier if they throw anything. That way any unexpected behavior 
would just result in logical types not being used.





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718679#comment-17718679
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182872918


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class byteBufferReadableClass = getReadableClass();
+  static final Constructor h2SeekableConstructor = 
getH2SeekableConstructor();

Review Comment:
   Looks like you'd need to add the `orNull` handling. I don't see it here: 
https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/util/DynConstructors.java





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718680#comment-17718680
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182873931


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));
+}
+if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+  byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+  try {
+return h2SeekableConstructor.newInstance(stream);
+  } catch (InstantiationException | IllegalAccessException e) {
+LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
+  } catch (InvocationTargetException e) {
+throw new ParquetDecodingException(
+  "Could not instantiate H2SeekableInputStream", 
e.getTargetException());
+  }
 }
+return new H1SeekableInputStream(stream);
   }
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
-  // stream is issuing the guarantee that it implements the
-  // API. Holds for all implementations in hadoop-*
-  // since Hadoop 3.3.0 (HDFS-14111).
-  return true;
+  private static Boolean 
isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {
+Method methodHasCapabilities;
+try {
+  methodHasCapabilities = stream.getClass().getMethod("hasCapability", 
String.class);

Review Comment:
   You can use DynMethods to get an unbound `hasCapability` method. That can be 
done statically, so all you need to do is check whether it is present and call 
it.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
> 

[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718694#comment-17718694
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182943149


##
parquet-avro/pom.xml:
##
@@ -105,6 +110,30 @@
   test-jar
   test
 
+
+  org.mockito
+  mockito-core
+  2.23.0

Review Comment:
   the declared `mockito.version` in the root `pom.xml`, `1.10.19`, is 
incompatible with Powermock 2.0.x but I can't downgrade Powermock to 1.x 
without the test throwing some scary objenesis errors, so I think Powermock 2.x 
is the lowest we can go.





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718726#comment-17718726
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183057530


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class byteBufferReadableClass = getReadableClass();
+  static final Constructor h2SeekableConstructor = 
getH2SeekableConstructor();

Review Comment:
   Huh, I guess this was how it was before? Nevermind on the refactoring then.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718727#comment-17718727
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183059127


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));

Review Comment:
   Why would a FSDataInputStream have another one inside?



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));
+}
+if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+  byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+  try {
+return h2SeekableConstructor.newInstance(stream);
+  } catch (InstantiationException | IllegalAccessException e) {
+LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
+  } catch (InvocationTargetException e) {
+throw new ParquetDecodingException(
+  "Could not instantiate H2SeekableInputStream", 
e.getTargetException());
+  }
 }
+return new H1SeekableInputStream(stream);
   }
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStrea

[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718728#comment-17718728
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183066231


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));

Review Comment:
   This came from the issue from Presto: 
https://github.com/prestodb/presto/pull/17435





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718735#comment-17718735
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183079903


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class byteBufferReadableClass = getReadableClass();
+  static final Constructor h2SeekableConstructor = 
getH2SeekableConstructor();

Review Comment:
   Yes, I copied most from the old code to avoid refactoring. I think we can 
greatly simplify it because it was still taking Hadoop1 into account. We still 
have to check if the wrapped stream is ByteBufferReadable: 
https://github.com/apache/hadoop/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L142-L148
   
   The `hasCapabilities does the same but in a more elegant way.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718736#comment-17718736
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183080135


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));
+}
+if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+  byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+  try {
+return h2SeekableConstructor.newInstance(stream);
+  } catch (InstantiationException | IllegalAccessException e) {
+LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
+  } catch (InvocationTargetException e) {
+throw new ParquetDecodingException(
+  "Could not instantiate H2SeekableInputStream", 
e.getTargetException());
+  }
 }
+return new H1SeekableInputStream(stream);
   }
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
-  // stream is issuing the guarantee that it implements the
-  // API. Holds for all implementations in hadoop-*
-  // since Hadoop 3.3.0 (HDFS-14111).
-  return true;
+  private static Boolean 
isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {
+Method methodHasCapabilities;
+try {
+  methodHasCapabilities = stream.getClass().getMethod("hasCapability", 
String.class);

Review Comment:
   Added it





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following

[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718738#comment-17718738
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1532200533

   > I think this should work after comparing it with older code, but it seems 
like there are some easy improvements to me.
   
   @rdblue I agree, I did some cleaning up. Let me know what you think.




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718985#comment-17718985
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1533328928

   FWIW, I also ran the Iceberg tests and it ran fine (except the bloom filter 
ones, more details 
[here](https://github.com/apache/iceberg/pull/7301#issuecomment-1528874719)).




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719113#comment-17719113
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

wgtmac commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1184468696


##
parquet-avro/pom.xml:
##
@@ -105,6 +110,30 @@
   test-jar
   test
 
+
+  org.mockito
+  mockito-core
+  2.23.0

Review Comment:
   I think this is fine. BTW, is it compatible to upgrade this in the root pom 
as well?





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719149#comment-17719149
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

wgtmac commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1534100022

   @clairemcginty Could you make the CIs happy?




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719170#comment-17719170
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

ggershinsky opened a new pull request, #1089:
URL: https://github.com/apache/parquet-mr/pull/1089

   https://issues.apache.org/jira/browse/PARQUET-2297




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719298#comment-17719298
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

steveloughran commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1534680365

   I repeat my stance on this: to claim hadoop 2.7 runtime compatibility you 
should be building against java7. if you don't, well, make clear its a fairly 
qualified support "hadoop 2.7.3 on java8 only" and not worry about the bits of 
hadoop which break if they try to do that (kerberos, s3a, anything with 
joda-time, ...)




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719302#comment-17719302
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

steveloughran commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1184944471


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,41 +83,91 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
-  return new H2SeekableInputStream(stream);
-} else {
-  return new H1SeekableInputStream(stream);
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));

Review Comment:
   you can if you try hard, it's just really unusual
   
   you can never wrap an instance by itself.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719317#comment-17719317
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1184991179


##
parquet-avro/pom.xml:
##
@@ -105,6 +110,30 @@
   test-jar
   test
 
+
+  org.mockito
+  mockito-core
+  2.23.0

Review Comment:
   It should be, but it does require an artifact migration in a few modules 
(parquet-column and parquet-hadoop depend on mockito-all, which has moved to 
mockito-core in 2.x). I would be happy to do it, just not sure if it's in scope 
for this PR :) 





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719335#comment-17719335
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1534819205

   @wgtmac, sorry about that 

> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719366#comment-17719366
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185124781


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` 
field to hold all conversions, so I feel reasonably confident about relying on 
this. If that changes, we'll catch it in the new unit tests +1
   
   This sounds perfect to me. Thanks a lot for the additional work!
   
   > If you want, I can surround invocations of `getModelForSchema` in a 
try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default 
SpecificDataSupplier if they throw anything. That way any unexpected behavior 
would just result in logical types not being used.
   
   Yes, I think this fallback mechanism sounds reasonable to me.
   
   





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719365#comment-17719365
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185124781


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` 
field to hold all conversions, so I feel reasonably confident about relying on 
this. If that changes, we'll catch it in the new unit tests +1
   This sounds perfect to me. Thanks a lot for the additional work!
   
   > If you want, I can surround invocations of `getModelForSchema` in a 
try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default 
SpecificDataSupplier if they throw anything. That way any unexpected behavior 
would just result in logical types not being used.
   Yes, I think this fallback mechanism sounds reasonable to me.
   
   





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719387#comment-17719387
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

shangxinli commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185179904


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,7 +54,39 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);

Review Comment:
   It is good to add this debug log in case recursive methods cause an infinite 
loop, we can enable this logging and debug. 





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719388#comment-17719388
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

shangxinli commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185181599


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing 
hasCapabilities
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean 
isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {

Review Comment:
   This long method name is more meaningful but it seems too long. But this is 
a minor comment and doesn't need to be changed if no better name. 





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719404#comment-17719404
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185224751


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##
@@ -169,6 +172,46 @@ public void add(Object value) {
 }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by 
reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath 
or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+final Class clazz;
+
+if (schema != null && (schema.getType() == Schema.Type.RECORD || 
schema.getType() == Schema.Type.UNION)) {
+  clazz = SpecificData.get().getClass(schema);
+} else {
+  return null;
+}
+
+final SpecificData model;
+try {
+  final Field modelField = clazz.getDeclaredField("MODEL$");
+  modelField.setAccessible(true);
+
+  model = (SpecificData) modelField.get(null);
+} catch (Exception e) {
+  return null;
+}
+
+try {
+  final String avroVersion = 
Schema.Parser.class.getPackage().getImplementationVersion();
+  // Avro 1.8 doesn't include conversions in the MODEL$ field
+  if (avroVersion.startsWith("1.8.")) {

Review Comment:
   Great! I added code wrapping `getModelForSchema` in a try/catch at each call 
site.





> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719464#comment-17719464
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185348113


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,7 +54,39 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = 
isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);

Review Comment:
   Yes, I kept it in there.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719465#comment-17719465
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185348684


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing 
hasCapabilities
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean 
isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {

Review Comment:
   I agree, I've changed it to `isWrappedStreamByteBufferReadable`





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719517#comment-17719517
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535331312

   thanks for the review @gszadovszky! very excited to start using this in 
Parquet :) 




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719520#comment-17719520
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535348289

   I wouldn't like to make you sad, @clairemcginty, but we just released 
`1.13.0` last month and the previous one was almost a year ago. Meanwhile, 
nothing says we cannot do releases for tiny features, it is more about the 
effort and who has the time to do it. I usually don't, unfortunately.
   Maybe if you ask nicely, @wgtmac would do another one... :wink:




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719526#comment-17719526
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535392207

   Oh no, I'm sorry I missed the boat! 😭 Is there anything that I (or my 
organization) could do to help out with the release, @wgtmac? If it helps at 
all, having this fix released would drastically help us drive Parquet adoption, 
and lead to more OSS contributions down the line :) 




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719583#comment-17719583
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

wgtmac commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535595073

   The parquet community is discussing a new `1.13.1` release to address some 
issues introduced by `1.13.0`: 
https://lists.apache.org/thread/1mjvdcmwqjcblmfkfgpd9ob2yodx7tom
   
   As the release manager, what do you think if we include this PR to the 
`1.13.1` release? @Fokko 
   




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719591#comment-17719591
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

wgtmac commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1185642897


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator of the number of distinct values that have been inserted 
so far
-  private int distinctValueCounter = 0;
+  private long distinctValueCounter = 0;
 
   // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
   private boolean finalized = false;
 
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
   private int maximumBytes = UPPER_BOUND_BYTES;
   private int minimumBytes = LOWER_BOUND_BYTES;
   // the hash strategy used in this bloom filter.
   private final HashStrategy hashStrategy;
   // the column to build bloom filter
   private ColumnDescriptor column;
 
-  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
-  }
-
-  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.

Review Comment:
   ```suggestion
  * Generate bloom filter candidates according to the maximum acceptable 
byte size.
   ```



##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator of the number of distinct values that have been inserted 
so far
-  private int d

[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719593#comment-17719593
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

wgtmac commented on PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089#issuecomment-1535610371

   Should we include this fix to the next 1.13.1 release: 
https://lists.apache.org/thread/1mjvdcmwqjcblmfkfgpd9ob2yodx7tom ?  
@ggershinsky 




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719638#comment-17719638
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

ggershinsky commented on PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089#issuecomment-1535733579

   SGTM, I'll send a PR to the parquet-1.13.x branch too




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719708#comment-17719708
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535844366

   @clairemcginty, could you reply on the thread that you would like to have 
this one included? It is better to have a broader audience for discussing the 
release content. (Note: We should usually not include features in patch 
releases to lower the potential risks. But it is up to the community to decide.)




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719709#comment-17719709
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky merged PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719713#comment-17719713
 ] 

ASF GitHub Bot commented on PARQUET-2256:
-

mapleFU closed pull request #195: PARQUET-2256: Add BloomFilter Compression
URL: https://github.com/apache/parquet-format/pull/195




> Adding Compression for BloomFilter
> --
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.9.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719711#comment-17719711
 ] 

ASF GitHub Bot commented on PARQUET-2256:
-

mapleFU commented on PR #195:
URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1535846816

   I've also using Roaring Bitmap to testing, when bitmap is dense, it also bad 
in compression. So close this patch first. If anyone wants it, he/she can 
reopen it again.




> Adding Compression for BloomFilter
> --
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.9.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719828#comment-17719828
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536169659

   @gszadovszky I wasn't previously subscribed to the dev mailing list, so I 
just subscribed 

> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719839#comment-17719839
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536206412

   @clairemcginty, I think, there is an option to reply from you own mail 
client in ponymail (the thread link above).




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719926#comment-17719926
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186277590


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,7 +54,39 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadable(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {

Review Comment:
   Usually, `isSomething` methods return a boolean. This is unwrapping, so I'd 
prefer naming it `unwrapByteBufferReadableLegacy` or something to be more clear.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719927#comment-17719927
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186278151


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -46,7 +54,39 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (isWrappedStreamByteBufferReadable(stream)) {
+
+// Try to check using hasCapabilities(str)
+Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadable(stream);
+
+// If it is null, then fall back to the old method
+if (hasCapabilitiesResult != null) {
+  if (hasCapabilitiesResult) {
+return new H2SeekableInputStream(stream);
+  } else {
+return new H1SeekableInputStream(stream);
+  }
+}
+
+return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if 
the stream is not seekable
+   */
+  private static SeekableInputStream 
isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  LOG.debug("Checking on wrapped stream {} of {} whether is 
ByteBufferReadable", wrapped, stream);
+  return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) 
wrapped));
+}
+if (stream.getWrappedStream() instanceof ByteBufferReadable) {

Review Comment:
   I prefer using the same whitespace conventions as in Iceberg, although 
that's a bit more relaxed over here.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719929#comment-17719929
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186279482


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing 
hasCapabilities
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (hasCapabilitiesMethod.isNoop()) {
+  // When the method is not available, just return a null
+  return null;
+}
+
+Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, 
"in:readbytebuffer");
+
+if (hasCapabilities) {

Review Comment:
   This variable would be more clear if it were called `isByteBufferReadable` 
since that's the capability we are checking for.





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719928#comment-17719928
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186279000


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing 
hasCapabilities
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (hasCapabilitiesMethod.isNoop()) {
+  // When the method is not available, just return a null
+  return null;
+}
+
+Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, 
"in:readbytebuffer");

Review Comment:
   Is this a boxed boolean? If so, should we update the check to handle null?





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719955#comment-17719955
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536628552

   thanks @gszadovszky! I used the Ponymail link from the thread, but I fear 
it's just created a new email thread - apologies 




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720009#comment-17720009
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1536815143

   Thanks @shangxinli, @wgtmac, @rdblue, and @steveloughran for the review.
   
   Steve, I'm aware of your stance and I also respect it. Unfortunately, a lot 
of companies are still using (internally heavy patched versions) of Hadoop, and 
to get traction in downstream projects we still have to maintain compatibility.
   




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720010#comment-17720010
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186521595


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 
   /**
* Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
* and implements ByteBufferReadable'
*
* That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
*
* In hadoop 3.3.0+, the StreamCapabilities probe can be used to
* check this: only those streams which provide the read(ByteBuffer)
* semantics MAY return true for the probe "in:readbytebuffer";
* FSDataInputStream will pass the probe down to the underlying stream.
*
* @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing 
hasCapabilities
*/
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
-if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (hasCapabilitiesMethod.isNoop()) {
+  // When the method is not available, just return a null
+  return null;
+}
+
+Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, 
"in:readbytebuffer");

Review Comment:
   I was assuming that it needed to be an object, but a primitive works as 
well, so changed that. Thanks for catching this!





> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720023#comment-17720023
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko merged PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720024#comment-17720024
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko opened a new pull request, #1090:
URL: https://github.com/apache/parquet-mr/pull/1090

   Make sure you have checked _all_ steps below.
   
   Backport to 1.13.1
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2276
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720026#comment-17720026
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

Fokko commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536851987

   Thanks for pining me here @wgtmac I just replied on the dev-list. I think we 
can include this one.




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2265) AvroParquetWriter should default to data supplier model from Configuration

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720053#comment-17720053
 ] 

ASF GitHub Bot commented on PARQUET-2265:
-

clairemcginty opened a new pull request, #1091:
URL: https://github.com/apache/parquet-mr/pull/1091

   Per discussion here 
https://lists.apache.org/thread/0qmwj33vohxymsfc1tw1f3141zrn62yp
   
   ### Jira
   
   - [z] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> AvroParquetWriter should default to data supplier model from Configuration
> --
>
> Key: PARQUET-2265
> URL: https://issues.apache.org/jira/browse/PARQUET-2265
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 1.14.0
>
>
> I recently ran into a bug where the AvroDataSupplier I specified in my 
> Configuration wasn't respected when creating an AvroParquetWriter:
>  
> ```
> Configuration configuration = new Configuration();
> configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)
> AvroParquetWriter writer =
>   AvroParquetWriter.builder(...)
>     .withSchema(...)
>     .withConf(configuration)
>     .build();
> ```
> In this instance, the writer's attached AvroWriteSupport uses a SpecificData 
> model, rather than the value of `myCustomDataSupplier.get()`. This is due to 
> AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in 
> the AvroParquetWriter.Builder.
> I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO 
> this creates confusion/redundancy, since I end up supplying the data model 
> twice; also, I can't create any abstractions around this (i.e. a 
> `createWriterForConfiguration(Configuration conf)` type of method) without 
> having to use reflection to invoke a dataModel for the value of 
> `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.
> I think it would be simplest if AvroWriteSupport just defaulted to `model = 
> null` and let AvroWriteSupport initialize it based on the Configuration[1]. 
> What do you think? That seems to be what AvroParquetReader is currently 
> doing[2].
>  
> [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]
> [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134]
>  
> [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2265) AvroParquetWriter should default to data supplier model from Configuration

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720057#comment-17720057
 ] 

ASF GitHub Bot commented on PARQUET-2265:
-

wgtmac commented on PR #1091:
URL: https://github.com/apache/parquet-mr/pull/1091#issuecomment-1536986378

   Thanks @clairemcginty for porting these commits to 1.13.x. Let's wait for 
green CIs.




> AvroParquetWriter should default to data supplier model from Configuration
> --
>
> Key: PARQUET-2265
> URL: https://issues.apache.org/jira/browse/PARQUET-2265
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> I recently ran into a bug where the AvroDataSupplier I specified in my 
> Configuration wasn't respected when creating an AvroParquetWriter:
>  
> ```
> Configuration configuration = new Configuration();
> configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)
> AvroParquetWriter writer =
>   AvroParquetWriter.builder(...)
>     .withSchema(...)
>     .withConf(configuration)
>     .build();
> ```
> In this instance, the writer's attached AvroWriteSupport uses a SpecificData 
> model, rather than the value of `myCustomDataSupplier.get()`. This is due to 
> AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in 
> the AvroParquetWriter.Builder.
> I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO 
> this creates confusion/redundancy, since I end up supplying the data model 
> twice; also, I can't create any abstractions around this (i.e. a 
> `createWriterForConfiguration(Configuration conf)` type of method) without 
> having to use reflection to invoke a dataModel for the value of 
> `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.
> I think it would be simplest if AvroWriteSupport just defaulted to `model = 
> null` and let AvroWriteSupport initialize it based on the Configuration[1]. 
> What do you think? That seems to be what AvroParquetReader is currently 
> doing[2].
>  
> [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]
> [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134]
>  
> [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720067#comment-17720067
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

wgtmac merged PR #1091:
URL: https://github.com/apache/parquet-mr/pull/1091




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2292) Improve default SpecificRecord model selection for Avro{Write,Read}Support

2023-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720068#comment-17720068
 ] 

ASF GitHub Bot commented on PARQUET-2292:
-

wgtmac commented on PR #1091:
URL: https://github.com/apache/parquet-mr/pull/1091#issuecomment-1537021927

   I just merged it. Thanks @clairemcginty!




> Improve default SpecificRecord model selection for Avro{Write,Read}Support
> --
>
> Key: PARQUET-2292
> URL: https://issues.apache.org/jira/browse/PARQUET-2292
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> AvroWriteSupport/AvroReadSupport can improve the precision of their default 
> `model` selection. Currently they default to new 
> SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that 
> contain logical types will fail out-of-the-box unless a specific 
> DATA_SUPPLIER is configured that contains logical type conversions.
> I think we can improve this and make logical types work by default by 
> defaulting to the value of the `MODEL$` field that every SpecificRecordBase 
> implementation contains, which already contains all the logical conversions 
> for that Avro type. It would require reflection, but that's what the Avro 
> library is already doing to fetch models for Specific types[1].
>  
> [0] 
> [https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L407]
> [1] 
> https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720143#comment-17720143
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

Fokko commented on code in PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089#discussion_r1186655384


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java:
##
@@ -173,7 +173,10 @@ private void initializeInternalReader(ParquetInputSplit 
split, Configuration con
   }
 }
 
-if (!reader.getRowGroups().isEmpty()) {
+if (!reader.getRowGroups().isEmpty() &&
+  // Encrypted files (parquet-mr 1.12+) can't have the delta encoding 
problem (resolved in parquet-mr 1.8)

Review Comment:
   Could we add a test for this?





> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720293#comment-17720293
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

ggershinsky commented on code in PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089#discussion_r1186829851


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java:
##
@@ -173,7 +173,10 @@ private void initializeInternalReader(ParquetInputSplit 
split, Configuration con
   }
 }
 
-if (!reader.getRowGroups().isEmpty()) {
+if (!reader.getRowGroups().isEmpty() &&
+  // Encrypted files (parquet-mr 1.12+) can't have the delta encoding 
problem (resolved in parquet-mr 1.8)

Review Comment:
   - with delta encoding problem: basically impossible to reproduce :), it was 
resolved in 1.8
   - without this problem: I've had a look at the existing unitests, 
unfortunately none can be used as a basis for adding a function for this 
particular situation. This will require building a new unitest from scratch. 
However, given that a) the patch is small and straightforward b) Spark stopped 
using this parquet read path - building a full unitest can be an overkill. But 
if you have a different opinion, please let me know.





> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720392#comment-17720392
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

ggershinsky opened a new pull request, #1092:
URL: https://github.com/apache/parquet-mr/pull/1092

   https://issues.apache.org/jira/browse/PARQUET-2297
   
   For branch 1.13.x




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720775#comment-17720775
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

ggershinsky merged PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720817#comment-17720817
 ] 

ASF GitHub Bot commented on PARQUET-2276:
-

Fokko merged PR #1090:
URL: https://github.com/apache/parquet-mr/pull/1090




> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720816#comment-17720816
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

Fokko merged PR #1092:
URL: https://github.com/apache/parquet-mr/pull/1092




> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720815#comment-17720815
 ] 

ASF GitHub Bot commented on PARQUET-2297:
-

Fokko commented on code in PR #1089:
URL: https://github.com/apache/parquet-mr/pull/1089#discussion_r1188231603


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java:
##
@@ -173,7 +173,10 @@ private void initializeInternalReader(ParquetInputSplit 
split, Configuration con
   }
 }
 
-if (!reader.getRowGroups().isEmpty()) {
+if (!reader.getRowGroups().isEmpty() &&
+  // Encrypted files (parquet-mr 1.12+) can't have the delta encoding 
problem (resolved in parquet-mr 1.8)

Review Comment:
   Thanks for the explanation, I'm fine with leaving out a unit test. Just 
curious if it would be easy to modify existing tests to make sure that we hit 
the code.





> Encrypted files should not be checked for delta encoding problem
> 
>
> Key: PARQUET-2297
> URL: https://issues.apache.org/jira/browse/PARQUET-2297
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) 
> was fixed in writers since parquet-mr-1.8. This fix also added a 
> `checkDeltaByteArrayProblem` method in readers, that runs over all columns 
> and checks for this problem in older files. 
> This now triggers an unrelated exception when reading encrypted files, in the 
> following situation: trying to read an unencrypted column, without having 
> keys for encrypted columns (see 
> https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, 
> with nested columns (files with regular columns are ok).
> Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
> encrypted files - because these files can be written only with 
> parquet-mr-1.12 and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2296) Bump easymock from 3.4 to 5.1.0

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720818#comment-17720818
 ] 

ASF GitHub Bot commented on PARQUET-2296:
-

Fokko merged PR #1088:
URL: https://github.com/apache/parquet-mr/pull/1088




> Bump easymock from 3.4 to 5.1.0
> ---
>
> Key: PARQUET-2296
> URL: https://issues.apache.org/jira/browse/PARQUET-2296
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2295) Bump truth-proto-extension from 1.0 to 1.1.3

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720819#comment-17720819
 ] 

ASF GitHub Bot commented on PARQUET-2295:
-

Fokko merged PR #1087:
URL: https://github.com/apache/parquet-mr/pull/1087




> Bump truth-proto-extension from 1.0 to 1.1.3
> 
>
> Key: PARQUET-2295
> URL: https://issues.apache.org/jira/browse/PARQUET-2295
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720917#comment-17720917
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188555345


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator of the number of distinct values that have been inserted 
so far
-  private int distinctValueCounter = 0;
+  private long distinctValueCounter = 0;
 
   // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
   private boolean finalized = false;
 
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
   private int maximumBytes = UPPER_BOUND_BYTES;
   private int minimumBytes = LOWER_BOUND_BYTES;
   // the hash strategy used in this bloom filter.
   private final HashStrategy hashStrategy;
   // the column to build bloom filter
   private ColumnDescriptor column;
 
-  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
-  }
-
-  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate

Review Comment:
   It means byte size and I have changed variable name, thank you 





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720922#comment-17720922
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188564979


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate largestCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private long distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate
+   * @param numCandidates the number of candidates
+   * @param fpp   the false positive probability
+   */
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, int numCandidates, 
double fpp, ColumnDescriptor column) {
+this(maximumBytes, HashStrategy.XXH64, fpp, numCandidates, column);
+  }
+
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, HashStrategy 
hashStrategy, double fpp,
+int numCandidates, ColumnDescriptor column) {
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(maximumBytes, numCandidates, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according
+   * to the bytes size. Because the bytes size of the candidate need to be a
+   * power of 2, we setting the candidate size according to `maxBytes` of 
`1/2`, `1/4`, `1/8`, etc.
+   *
+   * @param maxBytes  the maximum bit size 

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720923#comment-17720923
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188564979


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate largestCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private long distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate
+   * @param numCandidates the number of candidates
+   * @param fpp   the false positive probability
+   */
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, int numCandidates, 
double fpp, ColumnDescriptor column) {
+this(maximumBytes, HashStrategy.XXH64, fpp, numCandidates, column);
+  }
+
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, HashStrategy 
hashStrategy, double fpp,
+int numCandidates, ColumnDescriptor column) {
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(maximumBytes, numCandidates, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according
+   * to the bytes size. Because the bytes size of the candidate need to be a
+   * power of 2, we setting the candidate size according to `maxBytes` of 
`1/2`, `1/4`, `1/8`, etc.
+   *
+   * @param maxBytes  the maximum bit size 

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720924#comment-17720924
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188567859


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -152,6 +153,8 @@ public static enum JobSummaryLevel {
   public static final String BLOOM_FILTER_EXPECTED_NDV = 
"parquet.bloom.filter.expected.ndv";
   public static final String BLOOM_FILTER_MAX_BYTES = 
"parquet.bloom.filter.max.bytes";
   public static final String BLOOM_FILTER_FPP = "parquet.bloom.filter.fpp";
+  public static final String ADAPTIVE_BLOOM_FILTER_ENABLED = 
"parquet.bloom.filter.adaptive.enabled";

Review Comment:
   done, and I changed `parquet.bloom.filter.candidate.size` to 
`parquet.bloom.filter.candidates.number`.





> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720926#comment-17720926
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188577599


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and used as an approximate deduplication 
counter
+  private BloomFilterCandidate largestCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private long distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate
+   * @param numCandidates the number of candidates
+   * @param fpp   the false positive probability
+   */
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, int numCandidates, 
double fpp, ColumnDescriptor column) {
+this(maximumBytes, HashStrategy.XXH64, fpp, numCandidates, column);
+  }
+
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, HashStrategy 
hashStrategy, double fpp,
+int numCandidates, ColumnDescriptor column) {
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(maximumBytes, numCandidates, fpp);
+  }
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according
+   * to the bytes size. Because the bytes size of the candidate need to be a
+   * power of 2, we setting the candidate size according to `maxBytes` of 
`1/2`, `1/4`, `1/8`, etc.
+   *
+   * @param maxBytes  the maximum bit size 

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720927#comment-17720927
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188579419


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator of the number of distinct values that have been inserted 
so far
-  private int distinctValueCounter = 0;
+  private long distinctValueCounter = 0;
 
   // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
   private boolean finalized = false;
 
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
   private int maximumBytes = UPPER_BOUND_BYTES;
   private int minimumBytes = LOWER_BOUND_BYTES;
   // the hash strategy used in this bloom filter.
   private final HashStrategy hashStrategy;
   // the column to build bloom filter
   private ColumnDescriptor column;
 
-  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
-  }
-
-  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
-this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate

Review Comment:
   @wgtmac sorry for late reply, I am little busy these days.
   It means byte size and I have changed variable name, thank you



##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721141#comment-17721141
 ] 

ASF GitHub Bot commented on PARQUET-2254:
-

wgtmac commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1189285380


##
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static 
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
+ * of real data distinct values.
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
+ * the candidates at the same time. Finally we will choose the smallest 
candidate to write out.
+ */
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
+
+  // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
+  // expected NDV of one candidate, it will be removed. Finally we will choose 
the smallest candidate to write out.
+  private final List candidates = new ArrayList<>();
+
+  // the largest among candidates and also used as an approximate 
deduplication counter
+  private BloomFilterCandidate largestCandidate;
+
+  // the accumulator of the number of distinct values that have been inserted 
so far
+  private long distinctValueCounter = 0;
+
+  // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
+  private boolean finalized = false;
+
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
+  private int maximumBytes = UPPER_BOUND_BYTES;
+  private int minimumBytes = LOWER_BOUND_BYTES;
+  private int minimumCandidateNdv = 16;
+  // the hash strategy used in this bloom filter.
+  private final HashStrategy hashStrategy;
+  // the column to build bloom filter
+  private ColumnDescriptor column;
+
+  /**
+   * Given the maximum acceptable bytes size of bloom filter.
+   *
+   * @param maximumBytes  the maximum bytes size of candidate
+   * @param numCandidates the number of candidates
+   * @param fpp   the false positive probability
+   */
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, int numCandidates, 
double fpp, ColumnDescriptor column) {
+this(maximumBytes, HashStrategy.XXH64, fpp, numCandidates, column);
+  }
+
+  public AdaptiveBlockSplitBloomFilter(int maximumBytes, HashStrategy 
hashStrategy, double fpp,
+int numCandidates, ColumnDescriptor column) {
+this.column = column;
+switch (hashStrategy) {
+  case XXH64:
+this.hashStrategy = hashStrategy;
+break;
+  default:
+throw new RuntimeException("Unsupported hash strategy");
+}
+initCandidates(maximumBytes, numCandidates, fpp);
+  }
+
+  /**
+   * This method will generate candidates according to the maximum acceptable 
bytes size of bloom filter.
+   * Because the bytes size of the candidate need to be a power of 2, here we 
set the candidate size to be
+   * a proportion of `maxBytes` like `1/2`, `1/4`, `1/8`, etc.
+   *
+   * @param maxBytes  the maximum bytes size of candidate
+   * @param numCandidates the number of candida

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721586#comment-17721586
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1543226234

   @JFinis Do you have a plan to revive this?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >