[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534123#comment-17534123 ] Micah Kornfield commented on PARQUET-2122: -- I believe the answer is the Bloom filter implementation isn't adaptive, so it simply preallocates all the bytes necessary. It would certainly be a nice option to have more adaptive data structures that can scale down for smaller files but is probably a decent amount of work to build consensus around this. > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492593#comment-17492593 ] Ze'ev Maor commented on PARQUET-2122: - [~junjie] thanks, that worked, though it does seem odd that a MAX size on bloom filter of 1MB would actually result in 1MB used by a Bloom filter on a column with cardinality of just 14 isn't it? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492418#comment-17492418 ] Junjie Chen commented on PARQUET-2122: -- That's the default size of the bloom filter. Please configure parquet.bloom.filter.max.bytes to fit. > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492099#comment-17492099 ] Xinli Shang commented on PARQUET-2122: -- [~junjie]Do you know why? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)