date:20230308

[jira] [Commented] (PARQUET-1889) Register a MIME type for the Parquet format.

2023-03-08 Thread Bryce Mecum (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698080#comment-17698080
 ] 

Bryce Mecum commented on PARQUET-1889:
--

It looks like a request to IANA to register application/vnd.apache.parquet was 
submitted sometime early in 2023, as evidenced by this entry in the Parquet dev 
ML: https://lists.apache.org/thread/lrfsjhzoq20o95z5zn9zyrb8rdolqzz7. It looks 
like IANA has requested changes on the initial application so I'll keep an eye 
on the ML and update here when we can close this.

> Register a MIME type for the Parquet format.
> 
>
> Key: PARQUET-1889
> URL: https://issues.apache.org/jira/browse/PARQUET-1889
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-format
>Affects Versions: format-2.7.0
>Reporter: Mark Wood
>Priority: Major
>
> There is currently  no MIME type registered for Parquet.  Perhaps this is 
> intentional.
> If it is not intentional, I suggest steps be taken to register a MIME type 
> with IANA.
>  
> [https://www.iana.org/assignments/media-types/media-types.xhtml]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698051#comment-17698051
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1129917664


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   sounds good 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] vectorijk commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-03-08 Thread via GitHub



vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1129917664


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   sounds good 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-03-08 Thread Mars (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars reassigned PARQUET-2237:
-

Assignee: Mars

> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-08 Thread Mars (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697987#comment-17697987
 ] 

Mars commented on PARQUET-2254:
---

[~wgtmac] [~gszadovszky] 
1) This Jira is used to track the building of a more accurate size BloomFilter. 
As the description says, this is my general idea, I will complete a version 
first.
2) PARQUET-2237 is used to optimize RowGroupFilter.
As [~gszadovszky]  said, I am also unsure wether it is better to check 
dictionary first or check the bloom filter first or the other way around.
But I think one thing is sure, if there is a dictionary, no longer comparing 
BloomFilter will definitely be better than the previous implementation.
As for the one step ahead, the order of checking bloom and dictionary , this 
needs to be considered more.

3) in PARQUET-2237
I have a new idea now , if there are multiple filter predicates, such as `OR` 
connected, we can optimize filter predicates one by one..
for example A>3 or B = 1 :
if in statistics filter,  A>3 was determined impossible(drop), B =1 mightMatch 
, the result of statistics filter is mightMatch
then in dictionary filter, we can only compare B and avoid comparing A to 
optimize performance

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697862#comment-17697862
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

wgtmac commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129335329


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   Thanks for the explanation. SGTM.





> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-03-08 Thread via GitHub



wgtmac commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129335329


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   Thanks for the explanation. SGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697843#comment-17697843
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

jerolba commented on PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#issuecomment-1460008854

   > Is the change exhaustive?
   
   Yes, I reviewed all calls to `checkArgument` to fix incorrect usage of %f 
and %d




> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] jerolba commented on pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-03-08 Thread via GitHub



jerolba commented on PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#issuecomment-1460008854

   > Is the change exhaustive?
   
   Yes, I reviewed all calls to `checkArgument` to fix incorrect usage of %f 
and %d


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697841#comment-17697841
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

jerolba commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129277490


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);

Review Comment:
   The same logic than previous comment. Fails in case of check failure.





> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] jerolba commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-03-08 Thread via GitHub



jerolba commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129277490


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);

Review Comment:
   The same logic than previous comment. Fails in case of check failure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697840#comment-17697840
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

jerolba commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129276294


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   The changed code was incorrect and in case of failure in the check, the code 
throws a `java.util.IllegalFormatConversionException`
   
   The existing [code behind checkArgument 
method](https://github.com/jerolba/parquet-mr/blob/5078cc629074046b5699544c33256ca980ab2761/parquet-common/src/main/java/org/apache/parquet/Preconditions.java#L243)
 converts all parameters to String before applying the String.format.
   
   If String.format finds an incorrect type, throws 
`IllegalFormatConversionException`. 
   
   For example try this:
   ```java
   String output = String.format("Integer value: %d", "10");
   ```
   
   Then, existing Preconditions implementation forces to use Strings messages 
with %s.
   
   I reviewed all calls to `checkArgument` to fix the incorrect format.





> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] jerolba commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-03-08 Thread via GitHub



jerolba commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1129276294


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   The changed code was incorrect and in case of failure in the check, the code 
throws a `java.util.IllegalFormatConversionException`
   
   The existing [code behind checkArgument 
method](https://github.com/jerolba/parquet-mr/blob/5078cc629074046b5699544c33256ca980ab2761/parquet-common/src/main/java/org/apache/parquet/Preconditions.java#L243)
 converts all parameters to String before applying the String.format.
   
   If String.format finds an incorrect type, throws 
`IllegalFormatConversionException`. 
   
   For example try this:
   ```java
   String output = String.format("Integer value: %d", "10");
   ```
   
   Then, existing Preconditions implementation forces to use Strings messages 
with %s.
   
   I reviewed all calls to `checkArgument` to fix the incorrect format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1889) Register a MIME type for the Parquet format.

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

[GitHub] [parquet-mr] vectorijk commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

[jira] [Assigned] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

[GitHub] [parquet-mr] jerolba commented on pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

[GitHub] [parquet-mr] jerolba commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

[GitHub] [parquet-mr] jerolba commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

13 matches

Site Navigation

Mail list logo

Footer information