markap14 commented on a change in pull request #5011: URL: https://github.com/apache/nifi/pull/5011#discussion_r616704347
########## File path: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/schema/inference/InferSchemaAccessStrategy.java ########## @@ -43,7 +43,7 @@ public InferSchemaAccessStrategy(final RecordSourceFactory<T> recordSourceFactor public RecordSchema getSchema(final Map<String, String> variables, final InputStream contentStream, final RecordSchema readSchema) throws IOException { // We expect to be able to mark/reset any length because we expect that the underlying stream here will be a ContentClaimInputStream, which is able to // re-read the content regardless of how much data is read. - contentStream.mark(10_000_000); + contentStream.mark(Integer.MAX_VALUE); Review comment: I think we want to keep this at a smaller value. The general contract of `InputStream` says that when `mark` is called, the stream must remember at least that many bytes but is free to remember more. As the comment above explains, the general expectation is that the stream will be of type ContentClaimInputStream. In that case, the value passed to `mark` is ignored because it can always roll back to the beginning of the stream (by re-reading the file under the hood). But keeping the 10 MB limit (or even a 1 MB limit would probably be okay) means that if the caller does wrap the InputStream in a BufferedInputStream, then we have the chance remember this amount of data. If it's changed to Integer.MAX_VALUE, that bound is basically removed, which can lead to OutOfMemoryError very easily, and that should be avoided at almost any cost as the entire JVM can become defunct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org