How to build up a record datum and validate in C++

Mark Petronic Tue, 12 Nov 2019 13:52:54 -0800

I am working with Avro 1.9.1 C++. I have reviewed the simple examples and
unit tests. I am having a hard time figuring out how to build a record
datum to encode into a memory buffer to provide to Kafka. I understand the
memory writer part - it's the building of the record datum that is stumping
me.


Here's my scenario. I want to create a library that can be called from some
existing C/C++ code the generates CSV records and the requirement is I have
to take as input a CSV header and record and Avro schema to map the CSV
into. The caller will provide the CSV header and a CSV data record. The CSV
header field names will exactly match the field names in an Avro JSON
schema designed for each CSV data type. However, the order of the fields in
the CSV might not match the order of the fields in the Avro schema and
there could be more or less fields in either. I need to validate all that
as part of my new library.

I am trying the simple case here where the Avro schema is flat, like the
CSV, and only has a top-level record with a bunch of fields matching those
in the CSV header. However, the Avro schema "may" have more fields than the
CSV because, as the application creating the CSV evolves, its stats do as
well. For example, it might no longer use some fields and add new fields.
The next version of the schema must still reflect the deleted fields (with
defaults) as well as new fields to support backwards/forward compatibility
from a consumer's perspective. We need to manage schema evolution in a
forward/backwards compatibility mode per the Confluent.io Kafka way of
doing this.

So, those are the preconditions. What I propose to do in this sort of
"Avroizing Proxy" is as follows:

INITIALIZATION:

1. Load and validate the Avro schema from a JSON file as a one-time
initialization action
2. Load a CSV header and create a vector of each field name parsed from the
CSV header as a one-time initialization action

FOR EACH CSV RECORD:

1. Create an empty record datum. I want to start with an empty one as a way
to ensure that, after the following steps complete, I end up will the
minimum record datum value needed to validate against the schema
2. Iterate over the CSV record, field by field. For each field value, use
its positional index to fetch the CSV header field name from the vector at
the same index
3. Using the same field name and the Avro schema, lookup the data type in
the schema for that field (We will only be dealing with the primitive types
like string, long, and float - no nested records, array, maps, enums, etc)
4. Use the data type to determine which type of primitive encoder (string,
long, float, etc) to use and attempt to cast the CSV value string into the
appropriate type required by the encoder
5. Call some method to add this encoded value to a record datum

FWIW - This is basically how I had this working in the C Avro library but
cannot seem to figure of the C++ equivalent.

AFTER EACH RECORD ITERATION COMPLETES:

It is possible that the CSV has 5 fields that all match fields in the
schema but the schema defines 7 fields due to backwards compatibility
requirements. So, now that the record darum is completely built up (with 5
fields from the CSV data), it will still NOT validate against the schema
that requires 7 fields. So, another step is required to deal with the 2
extra fields as follows:

AUTOMATIC DEFAULT HANDLING:

I don't want to make the caller have to provide all 7 fields. He has
evolved his app (and stats) and does not care about those "no longer used"
values in his code. Only, he has ensured that he has evolved his schema in
a compatible way so those old fields MUST stil be represented there. I want
the library to detect missing fields and fill in the defaults
automatically, iff defined.

1. For each field in the schema that is NOT in the record datum, check to
see if the field defines a default value
2. If a field has a default, then encode that field name into the record
datum and set its value to the default defined in the schema
3. If a field does NOT have a default, then the caller did not provide all
the required values and this would be treated as an error condition and the
record rejected
4. In the case of fully valid and now defaulted record, use the
avro::memoryOutputStream() and avro::binaryEncoder() to build the binary
Avro message that I can then publish to Kafka.

I know this is a lot of detail but I would appreciate any guidance on
specifically how to programmatically build up a datum that can then be run
through the appropriate Avro method, with a schema, to validate.

Thanks in advance.

How to build up a record datum and validate in C++

Reply via email to