subject:"\[GitHub\] \[parquet\-mr\] wgtmac commented on a diff in pull request #1024\: Parquet 2242\:record count for row group size check configurable"

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub



wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Why fixing this on an old branch but not on the master?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: Parquet 2242:record count for row group size check configurable

2023-02-09 Thread via GitHub



wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1102185588


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -95,6 +98,8 @@ private ParquetProperties(WriterVersion writerVersion, int 
pageSize, int dictPag
 this.enableDictionary = enableDict;
 this.minRowCountForPageSizeCheck = minRowCountForPageSizeCheck;
 this.maxRowCountForPageSizeCheck = maxRowCountForPageSizeCheck;
+this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;
+this.maxRowCountForBlockSizeCheck =maxRowCountForBlockSizeCheck;

Review Comment:
   ```suggestion
   this.maxRowCountForBlockSizeCheck = maxRowCountForBlockSizeCheck;
   ```



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -94,6 +95,10 @@ public InternalParquetRecordWriter(
 this.compressor = compressor;
 this.validating = validating;
 this.props = props;
+this.minRowCountForBlockSizeCheck = 
props.getMinRowCountForBlockSizeCheck();

Review Comment:
   Do not duplicate `minRowCountForBlockSizeCheck` and 
`maxRowCountForBlockSizeCheck` as class member variables because they are 
immutable and can be accessed via `this.props`.



##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -95,6 +98,8 @@ private ParquetProperties(WriterVersion writerVersion, int 
pageSize, int dictPag
 this.enableDictionary = enableDict;
 this.minRowCountForPageSizeCheck = minRowCountForPageSizeCheck;
 this.maxRowCountForPageSizeCheck = maxRowCountForPageSizeCheck;
+this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;

Review Comment:
   ```suggestion
   this.minRowCountForBlockSizeCheck = minRowCountForBlockSizeCheck;
   ```



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
 LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
nextRowGroupSize, recordCount);
 flushRowGroupToStore();
 initStore();
-recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
+recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, 
recordCount / 2), maxRowCountForBlockSizeCheck);

Review Comment:
   The comment in line 146 says `checking the memory size is relatively 
expensive`. Does this affect the default behavior and introduce regression in 
terms of writer time or file size for common cases? For example, does it check 
more frequently than before?



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Please add them to README.md in the parquet-hadoop directory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: Parquet 2242:record count for row group size check configurable

2 matches

Site Navigation

Mail list logo

Footer information