[GitHub] [parquet-format] gszadovszky commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-05 Thread via GitHub
gszadovszky commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1316059558 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-02 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313882045 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] dependabot[bot] commented on pull request #205: Bump libthrift from 0.16.0 to 0.18.1

2023-09-03 Thread via GitHub
dependabot[bot] commented on PR #205: URL: https://github.com/apache/parquet-format/pull/205#issuecomment-1704312502 Superseded by #213. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-format] dependabot[bot] closed pull request #205: Bump libthrift from 0.16.0 to 0.18.1

2023-09-03 Thread via GitHub
dependabot[bot] closed pull request #205: Bump libthrift from 0.16.0 to 0.18.1 URL: https://github.com/apache/parquet-format/pull/205 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [parquet-format] dependabot[bot] opened a new pull request, #213: Bump org.apache.thrift:libthrift from 0.16.0 to 0.19.0

2023-09-03 Thread via GitHub
dependabot[bot] opened a new pull request, #213: URL: https://github.com/apache/parquet-format/pull/213 Bumps [org.apache.thrift:libthrift](https://github.com/apache/thrift) from 0.16.0 to 0.19.0. Release notes Sourced from

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-05 Thread via GitHub
etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1707166521 > I think we can now move to the simpler option of just putting SizeStatistics on Column Index to consolidate everything? I would guess this would also make implementations simpler.

[GitHub] [parquet-format] emkornfield commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-05 Thread via GitHub
emkornfield commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1707150050 Based on https://github.com/apache/parquet-format/pull/197#discussion_r1316059558 I think we can now move to the simpler option of just putting SizeStatistics on Column Index

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-03 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1314311470 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] Fokko opened a new pull request, #214: Bump to Thrift 0.19.0

2023-09-03 Thread via GitHub
Fokko opened a new pull request, #214: URL: https://github.com/apache/parquet-format/pull/214 Make sure you have checked _all_ steps below. This should support Java 8 again ### Jira - [ ] My PR addresses the following [Parquet

[GitHub] [parquet-mr] wgtmac opened a new pull request, #1137: PARQUET-2343: Fixes NPE when rewriting file with multiple rowgroups

2023-09-03 Thread via GitHub
wgtmac opened a new pull request, #1137: URL: https://github.com/apache/parquet-mr/pull/1137 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in

[GitHub] [parquet-format] dependabot[bot] closed pull request #213: Bump org.apache.thrift:libthrift from 0.16.0 to 0.19.0

2023-09-03 Thread via GitHub
dependabot[bot] closed pull request #213: Bump org.apache.thrift:libthrift from 0.16.0 to 0.19.0 URL: https://github.com/apache/parquet-format/pull/213 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [parquet-format] Fokko merged pull request #209: Bump junit from 4.13.1 to 4.13.2

2023-09-03 Thread via GitHub
Fokko merged PR #209: URL: https://github.com/apache/parquet-format/pull/209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-format] Fokko commented on a diff in pull request #203: PARQUET-2313: Bump actions/setup-java from 1 to 3

2023-09-03 Thread via GitHub
Fokko commented on code in PR #203: URL: https://github.com/apache/parquet-format/pull/203#discussion_r1314324512 ## .github/workflows/test.yml: ## @@ -26,15 +26,16 @@ jobs: strategy: fail-fast: false matrix: -java: [ '1.8', '11' ] +java: [

[GitHub] [parquet-format] wgtmac commented on pull request #213: Bump org.apache.thrift:libthrift from 0.16.0 to 0.19.0

2023-09-03 Thread via GitHub
wgtmac commented on PR #213: URL: https://github.com/apache/parquet-format/pull/213#issuecomment-1704567581 @dependabot close -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [parquet-mr] wgtmac merged pull request #1136: PARQUET-2343: Fixes NPE when rewriting file with multiple rowgroups

2023-09-03 Thread via GitHub
wgtmac merged PR #1136: URL: https://github.com/apache/parquet-mr/pull/1136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-format] rdblue commented on pull request #214: PARQUET-2344: Bump to Thrift 0.19.0

2023-09-04 Thread via GitHub
rdblue commented on PR #214: URL: https://github.com/apache/parquet-format/pull/214#issuecomment-1705491009 Thanks, @Fokko! Good to be unblocked for thrift updates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [parquet-format] rdblue merged pull request #214: PARQUET-2344: Bump to Thrift 0.19.0

2023-09-04 Thread via GitHub
rdblue merged PR #214: URL: https://github.com/apache/parquet-format/pull/214 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-mr] wgtmac merged pull request #1137: PARQUET-2343: Fixes NPE when rewriting file with multiple rowgroups

2023-09-04 Thread via GitHub
wgtmac merged PR #1137: URL: https://github.com/apache/parquet-mr/pull/1137 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-06 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317881962 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-06 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318059860 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-06 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317866553 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-06 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317993757 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] dependabot[bot] opened a new pull request, #215: Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-10 Thread via GitHub
dependabot[bot] opened a new pull request, #215: URL: https://github.com/apache/parquet-format/pull/215 Bumps org.slf4j:slf4j-api from 1.7.12 to 2.0.9. [![Dependabot compatibility

[GitHub] [parquet-format] dependabot[bot] commented on pull request #204: Bump slf4j-api from 1.7.12 to 2.0.7

2023-09-10 Thread via GitHub
dependabot[bot] commented on PR #204: URL: https://github.com/apache/parquet-format/pull/204#issuecomment-1712817820 Superseded by #215. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-format] dependabot[bot] closed pull request #204: Bump slf4j-api from 1.7.12 to 2.0.7

2023-09-10 Thread via GitHub
dependabot[bot] closed pull request #204: Bump slf4j-api from 1.7.12 to 2.0.7 URL: https://github.com/apache/parquet-format/pull/204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318314149 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318393474 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318576796 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318260892 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318347248 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318413084 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318190332 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318334927 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318362226 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318362226 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318386752 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318413890 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318367540 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318384010 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318570484 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318292827 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318565796 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] wgtmac commented on pull request #215: PARQUET-2346: Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-13 Thread via GitHub
wgtmac commented on PR #215: URL: https://github.com/apache/parquet-format/pull/215#issuecomment-1718760542 @dependabot ignore this dependency -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-format] dependabot[bot] closed pull request #215: PARQUET-2346: Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-13 Thread via GitHub
dependabot[bot] closed pull request #215: PARQUET-2346: Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9 URL: https://github.com/apache/parquet-format/pull/215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [parquet-format] dependabot[bot] commented on pull request #215: PARQUET-2346: Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-13 Thread via GitHub
dependabot[bot] commented on PR #215: URL: https://github.com/apache/parquet-format/pull/215#issuecomment-1718760575 OK, I won't notify you about org.slf4j:slf4j-api again, unless you re-open this PR. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [parquet-mr] wwang-talend opened a new pull request, #1140: allow read old parquet file which is maked by old api with old avro version which allow wrong default value in schema

2023-09-14 Thread via GitHub
wwang-talend opened a new pull request, #1140: URL: https://github.com/apache/parquet-mr/pull/1140 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references

[GitHub] [parquet-mr] amousavigourabi opened a new pull request, #1141: PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration

2023-09-16 Thread via GitHub
amousavigourabi opened a new pull request, #1141: URL: https://github.com/apache/parquet-mr/pull/1141 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references

[GitHub] [parquet-mr] zhangjiashen opened a new pull request, #1142: [Parquet-1647] Add logical type FLOAT16

2023-09-17 Thread via GitHub
zhangjiashen opened a new pull request, #1142: URL: https://github.com/apache/parquet-mr/pull/1142 ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My

[GitHub] [parquet-mr] shangxinli commented on pull request #1139: PARQUET-2171: Support Hadoop vectored IO

2023-09-17 Thread via GitHub
shangxinli commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1722528243 @steveloughran Thanks a lot for creating this PR! This is an important feature that we improve the reading performance of Parquet. I just took a brief look and they look great! I

[GitHub] [parquet-mr] majdyz opened a new pull request, #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-24 Thread via GitHub
majdyz opened a new pull request, #1135: URL: https://github.com/apache/parquet-mr/pull/1135 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET-2342) issues and references

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-31 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1312152587 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +845,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-31 Thread via GitHub
etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1701548513 I've implemented option 2 now. As expected, the size impact is somewhat less due to less nesting in the thrift output. Here are some comparisson numbers (apologies, it seems my

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-31 Thread via GitHub
etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1701566823 I forgot to mention that for option 2 I added `unencoded_variable_width_stored_bytes` to the `PageLocation` struct. Now I think I'm leaning towards option 2. For some of my

[GitHub] [parquet-format] wgtmac commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
wgtmac commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1702498342 Thanks for the quick PoC! It seems that option 2 is the best at the moment. But option 1 has more flexibility if we intend to add more fields to SizeStatistics. -- This is an

[GitHub] [parquet-format] emkornfield commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1703090707 > As the implemention detail, can we ignore the rep-def histogram when max-rep <= 1, max-def <= 1? Since we already have page-ordinal in OffsetIndex and null-count in

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313375622 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313391303 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313391714 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313376199 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1703098968 OK, pushed updates. @etseidl @mapleFU @wgtmac @pitrou @gszadovszky hopefully we can say this is a good version to prototype implementation on? -- This is an automated message

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313329517 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +845,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + +

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1703115793 > hopefully we can say this is a good version to prototype implementation on? Looks good to me. I'll get started now. -- This is an automated message from the Apache Git

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313392696 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] mapleFU commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
mapleFU commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1703702682 Also cc @tustvoid as arrow-rs parquet maintainer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313547575 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313652871 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-01 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1313547575 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,74 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323069846 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323069846 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323059489 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323028211 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323028211 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323524900 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323505565 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + +

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-12 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323506069 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-mr] steveloughran opened a new pull request, #1139: PARQUET-2171: Support Hadoop vectored IO

2023-09-13 Thread via GitHub
steveloughran opened a new pull request, #1139: URL: https://github.com/apache/parquet-mr/pull/1139 Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-13 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1324650964 ## src/main/thrift/parquet.thrift: ## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /**

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318737835 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318770227 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318848187 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-08 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320231941 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-08 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320143220 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-08 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320256768 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-08 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320192142 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318930925 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319130288 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319130534 ## src/main/thrift/parquet.thrift: ## @@ -583,7 +659,12 @@ struct DataPageHeaderV2 { If missing it is considered compressed */ 7: optional bool

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319212210 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319219710 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319252134 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318930925 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318957040 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319010418 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

  1   2   3   4   5   6   7   8   9   10   >