[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656442#comment-17656442
 ] 

ASF GitHub Bot commented on PARQUET-2224:
-

dongjoon-hyun commented on PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376796050

   Thank you all, @shangxinli , @ggershinsky , @sunchao , @wgtmac .




> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox


dongjoon-hyun commented on PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376796050

   Thank you all, @shangxinli , @ggershinsky , @sunchao , @wgtmac .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656428#comment-17656428
 ] 

ASF GitHub Bot commented on PARQUET-2212:
-

shangxinli commented on PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1376755881

   @wgtmac Do you have time to have a look? 




> Add ByteBuffer api for decryptors to allow direct memory to be decrypted
> 
>
> Key: PARQUET-2212
> URL: https://issues.apache.org/jira/browse/PARQUET-2212
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
>
> The decrypt API in BlockCipher.Decryptor currently only provides an api that 
> takes in a byte array
> {code:java}
> byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);{code}
> A parquet reader that uses the DirectByteBufferAllocator has to incur the 
> cost of copying the data into a byte array (and sometimes back to a 
> DirectByteBuffer) to decrypt data.
> This proposes adding a new API that accepts ByteBuffer as input and avoids 
> the data copy.
> {code:java}
> ByteBuffer decrypt(ByteBuffer from, byte[] AAD);{code}
> The decryption in ColumnChunkPageReadStore can also be updated to use the 
> ByteBuffer based api if the buffer is a DirectByteBuffer. If the buffer is a 
> HeapByteBuffer, then we can continue to use the byte array API since that 
> does not incur a copy when the underlying byte array is accessed.
> Also, some investigation has shown that decryption with ByteBuffers is not 
> able to use hardware acceleration in JVM's before JDK17. In those cases, the 
> overall decryption speed is faster with byte arrays even after incurring the 
> overhead of making a copy. 
> The proposal, then, is to enable the use of the ByteBuffer api for 
> DirectByteBuffers only, and only if the JDK is JDK17 or higher or the user 
> explicitly configures it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread GitBox


shangxinli commented on PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1376755881

   @wgtmac Do you have time to have a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656427#comment-17656427
 ] 

ASF GitHub Bot commented on PARQUET-2212:
-

shangxinli commented on code in PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1065346689


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {
   try {
 BytesInput bytes = dataPageV1.getBytes();
-if (null != blockDecryptor) {
-  bytes = 
BytesInput.from(blockDecryptor.decrypt(bytes.toByteArray(), dataPageAAD));
+BytesInput decompressed;
+
+if (options.getAllocator().isDirect() && 
options.useOffHeapDecryptBuffer()) {
+  ByteBuffer byteBuffer = bytes.toByteBuffer();
+  if (!byteBuffer.isDirect()) {
+throw new ParquetDecodingException("Expected a direct buffer");
+  }
+  if (blockDecryptor != null) {
+byteBuffer = blockDecryptor.decrypt(byteBuffer, dataPageAAD);
+  }
+  long compressedSize = byteBuffer.limit();
+
+  ByteBuffer decompressedBuffer =
+  
options.getAllocator().allocate(dataPageV1.getUncompressedSize());
+  decompressor.decompress(byteBuffer, (int) compressedSize, 
decompressedBuffer,
+  dataPageV1.getUncompressedSize());
+
+  // HACKY: sometimes we need to do `flip` because the position of 
output bytebuffer is

Review Comment:
   Ok  I see





> Add ByteBuffer api for decryptors to allow direct memory to be decrypted
> 
>
> Key: PARQUET-2212
> URL: https://issues.apache.org/jira/browse/PARQUET-2212
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
>
> The decrypt API in BlockCipher.Decryptor currently only provides an api that 
> takes in a byte array
> {code:java}
> byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);{code}
> A parquet reader that uses the DirectByteBufferAllocator has to incur the 
> cost of copying the data into a byte array (and sometimes back to a 
> DirectByteBuffer) to decrypt data.
> This proposes adding a new API that accepts ByteBuffer as input and avoids 
> the data copy.
> {code:java}
> ByteBuffer decrypt(ByteBuffer from, byte[] AAD);{code}
> The decryption in ColumnChunkPageReadStore can also be updated to use the 
> ByteBuffer based api if the buffer is a DirectByteBuffer. If the buffer is a 
> HeapByteBuffer, then we can continue to use the byte array API since that 
> does not incur a copy when the underlying byte array is accessed.
> Also, some investigation has shown that decryption with ByteBuffers is not 
> able to use hardware acceleration in JVM's before JDK17. In those cases, the 
> overall decryption speed is faster with byte arrays even after incurring the 
> overhead of making a copy. 
> The proposal, then, is to enable the use of the ByteBuffer api for 
> DirectByteBuffers only, and only if the JDK is JDK17 or higher or the user 
> explicitly configures it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread GitBox


shangxinli commented on code in PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1065346689


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {
   try {
 BytesInput bytes = dataPageV1.getBytes();
-if (null != blockDecryptor) {
-  bytes = 
BytesInput.from(blockDecryptor.decrypt(bytes.toByteArray(), dataPageAAD));
+BytesInput decompressed;
+
+if (options.getAllocator().isDirect() && 
options.useOffHeapDecryptBuffer()) {
+  ByteBuffer byteBuffer = bytes.toByteBuffer();
+  if (!byteBuffer.isDirect()) {
+throw new ParquetDecodingException("Expected a direct buffer");
+  }
+  if (blockDecryptor != null) {
+byteBuffer = blockDecryptor.decrypt(byteBuffer, dataPageAAD);
+  }
+  long compressedSize = byteBuffer.limit();
+
+  ByteBuffer decompressedBuffer =
+  
options.getAllocator().allocate(dataPageV1.getUncompressedSize());
+  decompressor.decompress(byteBuffer, (int) compressedSize, 
decompressedBuffer,
+  dataPageV1.getUncompressedSize());
+
+  // HACKY: sometimes we need to do `flip` because the position of 
output bytebuffer is

Review Comment:
   Ok  I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656426#comment-17656426
 ] 

ASF GitHub Bot commented on PARQUET-2075:
-

shangxinli commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1376754942

   @gszadovszky I Just want to check if you have time to have a look. @wgtmac 
Just be nice to take over the work that we discussed earlier to have an 
aggregated rewriter. 




> Unified Rewriter Tool  
> ---
>
> Key: PARQUET-2075
> URL: https://issues.apache.org/jira/browse/PARQUET-2075
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Gang Wu
>Priority: Major
>
> During the discussion of PARQUET-2071, we came up with the idea of a 
> universal tool to translate the existing file to a different state while 
> skipping some level steps like encoding/decoding, to gain speed. For example, 
> only decompress pages and then compress directly. For PARQUET-2071, we only 
> decrypt and then encrypt directly. This will be useful for the existing data 
> to onboard Parquet features like column encryption, zstd etc. 
> We already have tools like trans-compression, column pruning etc. We will 
> consolidate all these tools with this universal tool. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-09 Thread GitBox


shangxinli commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1376754942

   @gszadovszky I Just want to check if you have time to have a look. @wgtmac 
Just be nice to take over the work that we discussed earlier to have an 
aggregated rewriter. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656424#comment-17656424
 ] 

ASF GitHub Bot commented on PARQUET-2224:
-

shangxinli commented on PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376754236

   Thank you @dongjoon-hyun for working on it! 




> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656425#comment-17656425
 ] 

ASF GitHub Bot commented on PARQUET-2224:
-

shangxinli merged PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017




> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli merged pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox


shangxinli merged PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox


shangxinli commented on PR #1017:
URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376754236

   Thank you @dongjoon-hyun for working on it! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2219) ParquetFileReader throws a runtime exception when a file contains only headers and now row data

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656422#comment-17656422
 ] 

ASF GitHub Bot commented on PARQUET-2219:
-

shangxinli commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1376751333

   @gszadovszky Nice to see you are back!




> ParquetFileReader throws a runtime exception when a file contains only 
> headers and now row data
> ---
>
> Key: PARQUET-2219
> URL: https://issues.apache.org/jira/browse/PARQUET-2219
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.1
>Reporter: chris stockton
>Assignee: Gang Wu
>Priority: Minor
>
> Google BigQuery has an option to export table data to Parquet-formatted 
> files, but some of these files are written with header data only.  When this 
> happens and these files are opened with the ParquetFileReader, an exception 
> is thrown:
> {{RuntimeException("Illegal row group of 0 rows");}}
> It seems like the ParquetFileReader should not throw an exception when it 
> encounters such a file.
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L949



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656421#comment-17656421
 ] 

ASF GitHub Bot commented on PARQUET-2160:
-

shangxinli commented on PR #982:
URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376750703

   @alexeykudinkin We might release a new patch in the next 2 or 3 months.  
   
   Can you elaborate why "this is a severe problem that does affect our ability 
to use Parquet w/ Zstd"? I understand it is an issue but we have been running 
Parquet w/ Zstd for 2+ years in production. 




> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Blocker
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-09 Thread GitBox


shangxinli commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1376751333

   @gszadovszky Nice to see you are back!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-09 Thread GitBox


shangxinli commented on PR #982:
URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376750703

   @alexeykudinkin We might release a new patch in the next 2 or 3 months.  
   
   Can you elaborate why "this is a severe problem that does affect our ability 
to use Parquet w/ Zstd"? I understand it is an issue but we have been running 
Parquet w/ Zstd for 2+ years in production. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656336#comment-17656336
 ] 

Alexey Kudinkin commented on PARQUET-2160:
--

Corresponding Spark issue:
https://issues.apache.org/jira/browse/SPARK-41952

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Blocker
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656334#comment-17656334
 ] 

ASF GitHub Bot commented on PARQUET-2160:
-

alexeykudinkin commented on PR #982:
URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280

   @gszadovszky @ggershinsky @shangxinli 
   
   Folks, do we have an approximate timeline for the next patch release that 
will be including this patch? 
   
   This is a severe problem that does affect our ability to use Parquet w/ Zstd 
and i'm aware of at least of handful of similar occasions and issues occurring 
to others.




> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Blocker
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] alexeykudinkin commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-09 Thread GitBox


alexeykudinkin commented on PR #982:
URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280

   @gszadovszky @ggershinsky @shangxinli 
   
   Folks, do we have an approximate timeline for the next patch release that 
will be including this patch? 
   
   This is a severe problem that does affect our ability to use Parquet w/ Zstd 
and i'm aware of at least of handful of similar occasions and issues occurring 
to others.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin resolved PARQUET-2160.
--
Resolution: Fixed

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Blocker
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2160:
-
Priority: Blocker  (was: Critical)

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Blocker
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2160:
-
Issue Type: Bug  (was: Improvement)

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Critical
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2160:
-
Component/s: parquet-format

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Critical
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2160:
-
Priority: Critical  (was: Major)

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Improvement
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Critical
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2160) Close decompression stream to free off-heap memory in time

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2160:
-
Affects Version/s: 1.12.3

> Close decompression stream to free off-heap memory in time
> --
>
> Key: PARQUET-2160
> URL: https://issues.apache.org/jira/browse/PARQUET-2160
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.12.3
> Environment: Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 
> 1.4.9.1 + glibc
>Reporter: Yujiang Zhong
>Priority: Critical
>
> The decompressed stream in HeapBytesDecompressor$decompress now relies on the 
> JVM GC to close. When reading parquet in zstd compressed format, sometimes I 
> ran into OOM cause high off-heap usage. I think the reason is that the GC is 
> not timely and causes off-heap memory fragmentation. I had to set  lower 
> MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. 
> There is a 
> [thread|[https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789=C025PH0G1D4]]
>  of this zstd parquet issus in Iceberg community slack:  some people had the 
> same problem. 
> I think maybe we can use ByteArrayBytesInput as decompressed bytes input and 
> close decompressed stream in time to solve this problem:
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.from(is, uncompressedSize); {code}
> ->
> {code:java}
> InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
> decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
> is.close(); {code}
> After I made this change to decompress, I found off-heap memory is 
> significantly reduced (with same query on same data).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656264#comment-17656264
 ] 

ASF GitHub Bot commented on PARQUET-758:


anjakefala commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1376199292

   Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing 
list for a vote? It seems  @gszadovszky is not available for insight; is there 
anyone else that can provide it?




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] anjakefala commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-01-09 Thread GitBox


anjakefala commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1376199292

   Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing 
list for a vote? It seems  @gszadovszky is not available for insight; is there 
anyone else that can provide it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2219) ParquetFileReader throws a runtime exception when a file contains only headers and now row data

2023-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655975#comment-17655975
 ] 

ASF GitHub Bot commented on PARQUET-2219:
-

gszadovszky commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1064374553


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1038,7 +1044,9 @@ public PageReadStore readNextFilteredRowGroup() throws 
IOException {
 }
 BlockMetaData block = blocks.get(currentBlock);
 if (block.getRowCount() == 0L) {
-  throw new RuntimeException("Illegal row group of 0 rows");
+  // Skip the empty block
+  advanceToNextBlock();
+  return readNextFilteredRowGroup();

Review Comment:
   There is a warning log in case of `readNextRowGroup` but here we don't log 
anything.



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+
+public class TestParquetReaderEmptyBlock {
+
+  private static final Path EMPTY_BLOCK_FILE_1 = 
createPathFromCP("/test-empty-row-group_1.parquet");
+
+  private static final Path EMPTY_BLOCK_FILE_2 = 
createPathFromCP("/test-empty-row-group_2.parquet");
+
+  private static Path createPathFromCP(String path) {
+try {
+  return new 
Path(TestParquetReaderEmptyBlock.class.getResource(path).toURI());
+} catch (URISyntaxException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @Test
+  public void testReadOnlyEmptyBlock() throws IOException {
+Configuration conf = new Configuration();
+ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, 
EMPTY_BLOCK_FILE_1);
+
+// The parquet file contains only one empty row group
+Assert.assertEquals(1, readFooter.getBlocks().size());
+
+// The empty block is skipped
+try (ParquetFileReader r = new ParquetFileReader(conf, EMPTY_BLOCK_FILE_1, 
readFooter)) {
+  Assert.assertNull(r.readNextRowGroup());
+}
+  }
+
+  @Test
+  public void testSkipEmptyBlock() throws IOException {
+Configuration conf = new Configuration();
+ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, 
EMPTY_BLOCK_FILE_2);
+
+// The parquet file contains three row groups, the second one is empty

Review Comment:
   I think, it would be nice to test the case of multiple empty row groups next 
to each other.





> ParquetFileReader throws a runtime exception when a file contains only 
> headers and now row data
> ---
>
> Key: PARQUET-2219
> URL: https://issues.apache.org/jira/browse/PARQUET-2219
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.1
>Reporter: chris stockton
>Assignee: Gang Wu
>Priority: Minor
>
> Google BigQuery has an option to export table data to Parquet-formatted 
> files, but some of these files are written with header data only.  When this 
> happens and these files are opened with the ParquetFileReader, an exception 
> is thrown:
> {{RuntimeException("Illegal row group of 0 rows");}}
> It seems like the ParquetFileReader should not throw an exception when it 
> encounters such a file.
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L949



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-09 Thread GitBox


gszadovszky commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1064374553


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1038,7 +1044,9 @@ public PageReadStore readNextFilteredRowGroup() throws 
IOException {
 }
 BlockMetaData block = blocks.get(currentBlock);
 if (block.getRowCount() == 0L) {
-  throw new RuntimeException("Illegal row group of 0 rows");
+  // Skip the empty block
+  advanceToNextBlock();
+  return readNextFilteredRowGroup();

Review Comment:
   There is a warning log in case of `readNextRowGroup` but here we don't log 
anything.



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+
+public class TestParquetReaderEmptyBlock {
+
+  private static final Path EMPTY_BLOCK_FILE_1 = 
createPathFromCP("/test-empty-row-group_1.parquet");
+
+  private static final Path EMPTY_BLOCK_FILE_2 = 
createPathFromCP("/test-empty-row-group_2.parquet");
+
+  private static Path createPathFromCP(String path) {
+try {
+  return new 
Path(TestParquetReaderEmptyBlock.class.getResource(path).toURI());
+} catch (URISyntaxException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @Test
+  public void testReadOnlyEmptyBlock() throws IOException {
+Configuration conf = new Configuration();
+ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, 
EMPTY_BLOCK_FILE_1);
+
+// The parquet file contains only one empty row group
+Assert.assertEquals(1, readFooter.getBlocks().size());
+
+// The empty block is skipped
+try (ParquetFileReader r = new ParquetFileReader(conf, EMPTY_BLOCK_FILE_1, 
readFooter)) {
+  Assert.assertNull(r.readNextRowGroup());
+}
+  }
+
+  @Test
+  public void testSkipEmptyBlock() throws IOException {
+Configuration conf = new Configuration();
+ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, 
EMPTY_BLOCK_FILE_2);
+
+// The parquet file contains three row groups, the second one is empty

Review Comment:
   I think, it would be nice to test the case of multiple empty row groups next 
to each other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1980) Build and test Apache Parquet on ARM64 CPU architecture

2023-01-09 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655959#comment-17655959
 ] 

Martin Tzvetanov Grigorov commented on PARQUET-1980:


Thanks for the ping, [~gszadovszky] !

Github Actions does not provide github-hosted Linux ARM64 runners, but we could 
use QEMU!

I will experiment and open a PR!

> Build and test Apache Parquet on ARM64 CPU architecture
> ---
>
> Key: PARQUET-1980
> URL: https://issues.apache.org/jira/browse/PARQUET-1980
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-format
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> More and more deployments are being done on ARM64 machines.
> It would be good to make sure Parquet MR project builds fine on it.
> The project moved from TravisCI to GitHub Actions recently (PARQUET-1969) but 
> .travis.yml could be re-intorduced for ARM64 until GitHub Actions provide 
> aarch64 nodes!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)