Re: Specific/GenericDatumReader performance and resolving decoders

Scott Carey Thu, 19 Apr 2012 09:20:57 -0700

I think this approach makes sense, reader=writer is common.  In addition to
record fields, unions are affected.


I have been thinking about the issue that resolving records is slower than
not for a while.  In theory, it could be just as fast because you can
pre-compute the steps needed and bake that into the reading logic.  This
seems like a reasonable way to avoid the cost for the case where schemas
equal.

Please open a JIRA ticket and put your preliminary thoughts there.  It is a
good place to discuss the technical bits of the issue even before you have a
patch.

On 4/19/12 2:09 AM, "Irving, Dave" <[email protected]> wrote:

> Hi,
>  
> Recently I¹ve been looking at the performance of avros
> SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
> quite usual for reader / writer schemas to be identical. Interestingly,
> GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
> So even if constructed with a single (reader/writer) schema, a
> ResolvingDecoder is still used.
> I experimented a little, and wrote a SpecificDatumReader which instead of
> being hard wired with a ResolvingDecoder, uses a DecodeStrategy  leaving the
> reader only dealing with Decoders directly.
> Details follow  but for same schema¹ decodes  the performance difference is
> impressive. For the types of records I deal with, a decode with reader schema
> == writer schema using this approach is about 1.6x faster than a standard
> SpecificDatumReader decode.
>  
>  
> interface DecodeStrategy
> {
>   Decoder configureForRead(Decoder in) throws IOException;
>  
>   void readComplete() throws IOException;
>  
>   void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
> Decoder in, SpecificDatumReader2 reader) throws IOException;
> }
>  
> The idea is that when we hit a record, instead of getting field order from a
> ResolvingDecoder directly, we just let the decode strategy do it for us
> (calling back for each field to the reader  allowing recursion).
> For e.g. when we know reader / writer schemas are identical, and we don¹t want
> validation  an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
> the fields direct from the provided record schema (calling back on the reader
> for each one):
>  
> ...
>  
> void decodeRecordFields(......)
> {
>   List<Field> fields = expected.getFields();
>   For (int i=0, len = fields.size(); i<len; ++i)
>   {
>     reader.readField(old, in, field, record);
>   }
> }
>  
> ...
>  
> The resolving decoder impl of this strategy just does a readFieldOrder¹ like
> GenericDatumReader does today.
>  
> For each read (given a Decoder), the datum reader lets the decode strategy
> return back the actual decoder to be used (via #configureForRead). This means
> that a resolving implementation can use this hook to configure the
> ResolvingDecoder and return this.
> The result is that the datum reader can work with same schema / validated
> schema / resolved schemas seamlessly without caring about the difference.
>  
> I thought I¹d share the approach before working on a full patch: Is this an
> approach you¹d be interested in taking back to core avro? Or is it a little
> niche? J
>  
> Cheers,
>  
> Dave
>  
> 
> This message w/attachments (message) is intended solely for the use of the
> intended recipient(s) and may contain information that is privileged,
> confidential or proprietary. If you are not an intended recipient, please
> notify the sender, and then please delete and destroy all copies and
> attachments, and be advised that any review or dissemination of, or the taking
> of any action in reliance on, the information contained in or attached to this
> message is prohibited.
> Unless specifically indicated, this message is not an offer to sell or a
> solicitation of any investment products or other financial product or service,
> an official confirmation of any transaction, or an official statement of
> Sender. Subject to applicable law, Sender may intercept, monitor, review and
> retain e-communications (EC) traveling through its networks/systems and may
> produce any such EC to regulators, law enforcement, in litigation and as
> required by law. 
> The laws of the country of each sender/recipient may impact the handling of
> EC, and EC may be archived, supervised and produced in countries other than
> the country in which you are located. This message cannot be guaranteed to be
> secure or free of errors or viruses.
> 
> References to "Sender" are references to any subsidiary of Bank of America
> Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are
> Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a
> Condition to Any Banking Service or Activity * Are Not Insured by Any Federal
> Government Agency. Attachments that are part of this EC may have additional
> important disclosures and disclaimers, which you should read. This message is
> subject to terms available at the following link:
> http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
> consent to the foregoing.

Re: Specific/GenericDatumReader performance and resolving decoders

Reply via email to