[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210535#comment-15210535
 ] 

Niels Basjes commented on AVRO-1704:
------------------------------------

I did some experimenting over the last week and I posted my changed version of 
Avro here: https://github.com/nielsbasjes/avro/tree/AVRO-1704

What I did so far:
# Added to Schema the getFingerPrint() method that uses the CRC-64-AVRO to 
calculate the schema finger print.
# Added a few SchemaStorage related classes that allow storing schemas in 
memory.
# Added to the generated classes the toBytes() method and the fromBytes static 
method. Both effectively call the 'real' implementations which are in the 
SpecificRecordBase class.

All of this passes all of the Java unit testing.

At the application end my test code (using 3 slightly different variations of 
the same schema) looks like this. 
This works exactly as I expect it to.
{code:java}
SchemaFactory.put(com.bol.measure.v1.Measurement.getClassSchema());
SchemaFactory.put(com.bol.measure.v2.Measurement.getClassSchema());
SchemaFactory.put(com.bol.measure.v3.Measurement.getClassSchema());

com.bol.measure.v1.Measurement measurement = 
DummyMeasurementFactory.createTestMeasurement(timestamp);
byte[] bytesV1 = measurement.toBytes();

com.bol.measure.v2.Measurement newBornV2 = 
com.bol.measure.v2.Measurement.fromBytes(bytesV1);
com.bol.measure.v3.Measurement newBornV3 = 
com.bol.measure.v3.Measurement.fromBytes(bytesV1);
{code}

Things currently missing: Documentation, extra tests, etc.

I could really use some feedback on the structure of my change and advice on 
how to approach the need to call a 'close()' method on the schema storage part.

Thanks.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to