[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Junjie Chen (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492418#comment-17492418
 ] 

Junjie Chen commented on PARQUET-2122:
--

That's the default size of the bloom filter. Please configure 
parquet.bloom.filter.max.bytes to fit.  

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2125) ParquetFileReader has a currentBlock information in a private field

2022-02-14 Thread Tanuja Dubey (Jira)
Tanuja Dubey created PARQUET-2125:
-

 Summary: ParquetFileReader has a currentBlock information in a 
private field
 Key: PARQUET-2125
 URL: https://issues.apache.org/jira/browse/PARQUET-2125
 Project: Parquet
  Issue Type: Wish
  Components: parquet-avro
Affects Versions: 1.8.1
Reporter: Tanuja Dubey


The currentBlock variable is a metric information which can be useful to know 
which block the current record is being read from. If this variable has a 
getter, it would be possible to skip over a certain blocks.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492347#comment-17492347
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

sekikn commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Thank you for the comment, @shangxinli! I updated the PR.
   Instead of removing the lines, I just added '(deprecated)' to 
parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] sekikn commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-14 Thread GitBox


sekikn commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Thank you for the comment, @shangxinli! I updated the PR.
   Instead of removing the lines, I just added '(deprecated)' to 
parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding

2022-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2124:

Labels: pull-request-available  (was: )

> Bad DCHECK For Intermixed Dictionary Encoding
> -
>
> Key: PARQUET-2124
> URL: https://issues.apache.org/jira/browse/PARQUET-2124
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parquet CPP has a DCHECK for a dictionary encoded page coming after a 
> non-dictionary encoded page. This is bad because the DCHECK can be triggered 
> by Parquet files that have a column that has a dictionary page, then a 
> non-dictionary encoded page, then a page of dictionary encoded 
> values(indices). Fuzzing found such a file. While this could be turned into 
> an exception, I don't see anything in the Parquet specification that 
> prohibits such an occurrence of pages.
> This situation has brought up on the mailing list 
> before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] 
> and it seems like this is valid but nobody is doing it.
> In the PR that added this 
> check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the 
> check is probably not needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding

2022-02-14 Thread William Butler (Jira)
William Butler created PARQUET-2124:
---

 Summary: Bad DCHECK For Intermixed Dictionary Encoding
 Key: PARQUET-2124
 URL: https://issues.apache.org/jira/browse/PARQUET-2124
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: William Butler
Assignee: William Butler


Parquet CPP has a DCHECK for a dictionary encoded page coming after a 
non-dictionary encoded page. This is bad because the DCHECK can be triggered by 
Parquet files that have a column that has a dictionary page, then a 
non-dictionary encoded page, then a page of dictionary encoded values(indices). 
Fuzzing found such a file. While this could be turned into an exception, I 
don't see anything in the Parquet specification that prohibits such an 
occurrence of pages.

This situation has brought up on the mailing list 
before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] and 
it seems like this is valid but nobody is doing it.

In the PR that added this 
check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the 
check is probably not needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2123) Invalid memory access in ScanFileContents

2022-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2123:

Labels: pull-request-available  (was: )

> Invalid memory access in ScanFileContents
> -
>
> Key: PARQUET-2123
> URL: https://issues.apache.org/jira/browse/PARQUET-2123
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a Parquet file has 0 columns, ScanFileContents will try to access the 
> 0th element of a size 0 vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2123) Invalid memory access in ScanFileContents

2022-02-14 Thread William Butler (Jira)
William Butler created PARQUET-2123:
---

 Summary: Invalid memory access in ScanFileContents
 Key: PARQUET-2123
 URL: https://issues.apache.org/jira/browse/PARQUET-2123
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: William Butler
Assignee: William Butler


When a Parquet file has 0 columns, ScanFileContents will try to access the 0th 
element of a size 0 vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492126#comment-17492126
 ] 

ASF GitHub Bot commented on PARQUET-2120:
-

shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

2022-02-14 Thread GitBox


shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492117#comment-17492117
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

shangxinli commented on pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753


   @sekikn Thanks for working on it! Just leave some minor comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-14 Thread GitBox


shangxinli commented on pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753


   @sekikn Thanks for working on it! Just leave some minor comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492108#comment-17492108
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

shangxinli commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here/

##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-14 Thread GitBox


shangxinli commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here/

##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099
 ] 

Xinli Shang edited comment on PARQUET-2122 at 2/14/22, 4:56 PM:


[~junjie] Do you know why? 


was (Author: sha...@uber.com):
[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099
 ] 

Xinli Shang commented on PARQUET-2122:
--

[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Z M (Jira)
Z M created PARQUET-2122:


 Summary: Adding Bloom filter to small Parquet file bloats in size 
X1700
 Key: PARQUET-2122
 URL: https://issues.apache.org/jira/browse/PARQUET-2122
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli, parquet-mr
Affects Versions: 1.13.0
Reporter: Z M
 Attachments: data.csv, data_index_bloom.parquet

Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
ParquetWriter then yields a 1049197B file.

It isn't clear what the extra space is used by.

Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)