wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1102185588
##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -95,6 +98,8 @@ private ParquetProperties(WriterVersion writerVersion, int
pageSize, int dictPag
this.enableDictionary = enableDict;
this.minRowCountForPageSizeCheck = minRowCountForPageSizeCheck;
this.maxRowCountForPageSizeCheck = maxRowCountForPageSizeCheck;
+this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;
+this.maxRowCountForBlockSizeCheck =maxRowCountForBlockSizeCheck;
Review Comment:
```suggestion
this.maxRowCountForBlockSizeCheck = maxRowCountForBlockSizeCheck;
```
##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -94,6 +95,10 @@ public InternalParquetRecordWriter(
this.compressor = compressor;
this.validating = validating;
this.props = props;
+this.minRowCountForBlockSizeCheck =
props.getMinRowCountForBlockSizeCheck();
Review Comment:
Do not duplicate `minRowCountForBlockSizeCheck` and
`maxRowCountForBlockSizeCheck` as class member variables because they are
immutable and can be accessed via `this.props`.
##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -95,6 +98,8 @@ private ParquetProperties(WriterVersion writerVersion, int
pageSize, int dictPag
this.enableDictionary = enableDict;
this.minRowCountForPageSizeCheck = minRowCountForPageSizeCheck;
this.maxRowCountForPageSizeCheck = maxRowCountForPageSizeCheck;
+this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;
Review Comment:
```suggestion
this.minRowCountForBlockSizeCheck = minRowCountForBlockSizeCheck;
```
##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize,
nextRowGroupSize, recordCount);
flushRowGroupToStore();
initStore();
-recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK,
recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
+recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck,
recordCount / 2), maxRowCountForBlockSizeCheck);
Review Comment:
The comment in line 146 says `checking the memory size is relatively
expensive`. Does this affect the default behavior and introduce regression in
terms of writer time or file size for common cases? For example, does it check
more frequently than before?
##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
public static final String MAX_PADDING_BYTES=
"parquet.writer.max-padding";
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK =
"parquet.page.size.row.check.min";
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK =
"parquet.page.size.row.check.max";
+ public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK =
"parquet.block.size.row.check.min";
Review Comment:
Please add them to README.md in the parquet-hadoop directory.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org