I think this approach makes sense, reader=writer is common. In addition to record fields, unions are affected.
I have been thinking about the issue that resolving records is slower than not for a while. In theory, it could be just as fast because you can pre-compute the steps needed and bake that into the reading logic. This seems like a reasonable way to avoid the cost for the case where schemas equal. Please open a JIRA ticket and put your preliminary thoughts there. It is a good place to discuss the technical bits of the issue even before you have a patch. On 4/19/12 2:09 AM, "Irving, Dave" <[email protected]> wrote: > Hi, > > Recently I¹ve been looking at the performance of avros > SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it > quite usual for reader / writer schemas to be identical. Interestingly, > GenericDatumReader bakes in the use of ResolvingDecoders right in to its core. > So even if constructed with a single (reader/writer) schema, a > ResolvingDecoder is still used. > I experimented a little, and wrote a SpecificDatumReader which instead of > being hard wired with a ResolvingDecoder, uses a DecodeStrategy leaving the > reader only dealing with Decoders directly. > Details follow but for same schema¹ decodes the performance difference is > impressive. For the types of records I deal with, a decode with reader schema > == writer schema using this approach is about 1.6x faster than a standard > SpecificDatumReader decode. > > > interface DecodeStrategy > { > Decoder configureForRead(Decoder in) throws IOException; > > void readComplete() throws IOException; > > void decodeRecordFields(Object old, SpecificRecord record, Schema expected, > Decoder in, SpecificDatumReader2 reader) throws IOException; > } > > The idea is that when we hit a record, instead of getting field order from a > ResolvingDecoder directly, we just let the decode strategy do it for us > (calling back for each field to the reader allowing recursion). > For e.g. when we know reader / writer schemas are identical, and we don¹t want > validation an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull > the fields direct from the provided record schema (calling back on the reader > for each one): > > ... > > void decodeRecordFields(......) > { > List<Field> fields = expected.getFields(); > For (int i=0, len = fields.size(); i<len; ++i) > { > reader.readField(old, in, field, record); > } > } > > ... > > The resolving decoder impl of this strategy just does a readFieldOrder¹ like > GenericDatumReader does today. > > For each read (given a Decoder), the datum reader lets the decode strategy > return back the actual decoder to be used (via #configureForRead). This means > that a resolving implementation can use this hook to configure the > ResolvingDecoder and return this. > The result is that the datum reader can work with same schema / validated > schema / resolved schemas seamlessly without caring about the difference. > > I thought I¹d share the approach before working on a full patch: Is this an > approach you¹d be interested in taking back to core avro? Or is it a little > niche? J > > Cheers, > > Dave > > > This message w/attachments (message) is intended solely for the use of the > intended recipient(s) and may contain information that is privileged, > confidential or proprietary. If you are not an intended recipient, please > notify the sender, and then please delete and destroy all copies and > attachments, and be advised that any review or dissemination of, or the taking > of any action in reliance on, the information contained in or attached to this > message is prohibited. > Unless specifically indicated, this message is not an offer to sell or a > solicitation of any investment products or other financial product or service, > an official confirmation of any transaction, or an official statement of > Sender. Subject to applicable law, Sender may intercept, monitor, review and > retain e-communications (EC) traveling through its networks/systems and may > produce any such EC to regulators, law enforcement, in litigation and as > required by law. > The laws of the country of each sender/recipient may impact the handling of > EC, and EC may be archived, supervised and produced in countries other than > the country in which you are located. This message cannot be guaranteed to be > secure or free of errors or viruses. > > References to "Sender" are references to any subsidiary of Bank of America > Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are > Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a > Condition to Any Banking Service or Activity * Are Not Insured by Any Federal > Government Agency. Attachments that are part of this EC may have additional > important disclosures and disclaimers, which you should read. This message is > subject to terms available at the following link: > http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you > consent to the foregoing.
