[GitHub] [nifi] markap14 commented on a change in pull request #5011: NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error

GitBox Tue, 20 Apr 2021 06:52:43 -0700


markap14 commented on a change in pull request #5011:
URL: https://github.com/apache/nifi/pull/5011#discussion_r616704347




##########
File path: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/schema/inference/InferSchemaAccessStrategy.java
##########
@@ -43,7 +43,7 @@ public InferSchemaAccessStrategy(final RecordSourceFactory<T> 
recordSourceFactor
     public RecordSchema getSchema(final Map<String, String> variables, final 
InputStream contentStream, final RecordSchema readSchema) throws IOException {
         // We expect to be able to mark/reset any length because we expect 
that the underlying stream here will be a ContentClaimInputStream, which is 
able to
         // re-read the content regardless of how much data is read.
-        contentStream.mark(10_000_000);
+        contentStream.mark(Integer.MAX_VALUE);

Review comment:
       I think we want to keep this at a smaller value. The general contract of 
`InputStream` says that when `mark` is called, the stream must remember at 
least that many bytes but is free to remember more. As the comment above 
explains, the general expectation is that the stream will be of type 
ContentClaimInputStream. In that case, the value passed to `mark` is ignored 
because it can always roll back to the beginning of the stream (by re-reading 
the file under the hood).
   
   But keeping the 10 MB limit (or even a 1 MB limit would probably be okay) 
means that if the caller does wrap the InputStream in a BufferedInputStream, 
then we have the chance remember this amount of data. If it's changed to 
Integer.MAX_VALUE, that bound is basically removed, which can lead to 
OutOfMemoryError very easily, and that should be avoided at almost any cost as 
the entire JVM can become defunct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [nifi] markap14 commented on a change in pull request #5011: NIFI-8437 RecordReader 'Infer Schema' for large records may throw BufferedInputStream error

Reply via email to