Mark Payne created NIFI-5640:
--------------------------------

             Summary: Improve efficiency of Avro Record Reader
                 Key: NIFI-5640
                 URL: https://issues.apache.org/jira/browse/NIFI-5640
             Project: Apache NiFi
          Issue Type: Improvement
            Reporter: Mark Payne
            Assignee: Mark Payne


There are a few things that we are doing in the Avro Reader that cause subpar 
performance. Firstly, in the AvroTypeUtil, when converting an Avro 
GenericRecord to our Record, the building of the RecordSchema is slow because 
we call toString() (which is quite expensive) on the Avro schema in order to 
provide a textual version to RecordSchema. However, the text is typically not 
used and it is optional to provide the schema text, so we should avoid calling 
Schema#toString() whenever possible.

The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some cases 
we don't really need to do this and can avoid creating the sublist. In other 
cases, we do need to call it. However, the method uses the stream() method on 
an existing List just to filter out 0 or 1 elements. While use of the stream() 
method makes the code very readable, it is quite a bit more expensive than just 
iterating over the existing list and adding to an ArrayList. We should avoid 
use of the {{stream()}} method for trivial pieces of code in time-critical 
parts of the codebase.

Additionally, I've found that Avro's GenericDatumReader is extremely 
inefficient, at least in some cases, when reading Strings because it uses an 
IdentityHashMap to cache details about the schema. But IdentityHashMap is far 
slower than if it were to just use HashMap so we could subclass the reader in 
order to avoid the slow caching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to