Hi,
Recently I've been looking at the performance of avros
SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
quite usual for reader / writer schemas to be identical. Interestingly,
GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
So even if constructed with a single (reader/writer) schema, a ResolvingDecoder
is still used.
I experimented a little, and wrote a SpecificDatumReader which instead of being
hard wired with a ResolvingDecoder, uses a DecodeStrategy - leaving the reader
only dealing with Decoders directly.
Details follow - but for 'same schema' decodes - the performance difference is
impressive. For the types of records I deal with, a decode with reader schema
== writer schema using this approach is about 1.6x faster than a standard
SpecificDatumReader decode.
interface DecodeStrategy
{
Decoder configureForRead(Decoder in) throws IOException;
void readComplete() throws IOException;
void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
Decoder in, SpecificDatumReader2 reader) throws IOException;
}
The idea is that when we hit a record, instead of getting field order from a
ResolvingDecoder directly, we just let the decode strategy do it for us
(calling back for each field to the reader - allowing recursion).
For e.g. when we know reader / writer schemas are identical, and we don't want
validation - an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
the fields direct from the provided record schema (calling back on the reader
for each one):
...
void decodeRecordFields(......)
{
List<Field> fields = expected.getFields();
For (int i=0, len = fields.size(); i<len; ++i)
{
reader.readField(old, in, field, record);
}
}
...
The resolving decoder impl of this strategy just does a 'readFieldOrder' like
GenericDatumReader does today.
For each read (given a Decoder), the datum reader lets the decode strategy
return back the actual decoder to be used (via #configureForRead). This means
that a resolving implementation can use this hook to configure the
ResolvingDecoder and return this.
The result is that the datum reader can work with same schema / validated
schema / resolved schemas seamlessly without caring about the difference.
I thought I'd share the approach before working on a full patch: Is this an
approach you'd be interested in taking back to core avro? Or is it a little
niche? :)
Cheers,
Dave
----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the
intended recipient(s) and may contain information that is privileged,
confidential or proprietary. If you are not an intended recipient, please
notify the sender, and then please delete and destroy all copies and
attachments, and be advised that any review or dissemination of, or the taking
of any action in reliance on, the information contained in or attached to this
message is prohibited.
Unless specifically indicated, this message is not an offer to sell or a
solicitation of any investment products or other financial product or service,
an official confirmation of any transaction, or an official statement of
Sender. Subject to applicable law, Sender may intercept, monitor, review and
retain e-communications (EC) traveling through its networks/systems and may
produce any such EC to regulators, law enforcement, in litigation and as
required by law.
The laws of the country of each sender/recipient may impact the handling of EC,
and EC may be archived, supervised and produced in countries other than the
country in which you are located. This message cannot be guaranteed to be
secure or free of errors or viruses.
References to "Sender" are references to any subsidiary of Bank of America
Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are
Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a
Condition to Any Banking Service or Activity * Are Not Insured by Any Federal
Government Agency. Attachments that are part of this EC may have additional
important disclosures and disclaimers, which you should read. This message is
subject to terms available at the following link:
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
consent to the foregoing.