Mark Payne created NIFI-11636:
---------------------------------

             Summary: ParquetReader buffers up to 2 GB of content into heap 
unnecessarily
                 Key: NIFI-11636
                 URL: https://issues.apache.org/jira/browse/NIFI-11636
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
            Reporter: Mark Payne
             Fix For: 1.latest, 2.latest


The Parquet Record Reader uses the NiFiSeekableInputStream. Because Parquet 
requires reading the footer first, this class is intended to use {{mark/reset}} 
so that we can read the footer and then reset back to the beginning.

To achieve this, it calls {{InputStream.mark(Integer.MAX_VALUE)}} which will 
buffer up to 2 GB onto heap. However, the underlying InputStream is the 
ContentClaimInputStream. The ContentClaimInputStream has smarts built into it 
to allow resetting without having to buffer content into memory. In particular, 
if you read over the {{limit}} provided and then call {{reset}} it will close 
the InputStream and open a new InputStream from the beginning of the FlowFIle 
content and seek to the desired offset.

Because of this, we don't need to use {{InputStream.mark(Integer.MAX_VALUE)}} 
and can instead use {{InputStream.mark(8192)}} or some similarly small value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to