Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Doug Cutting Fri, 03 Apr 2009 09:59:09 -0700

David Reiss wrote:

For those of you who don't have git, forrest, *and* Java 5
(not 6! 5!) installed, I built the docs and put them online:


http://www.projectornation.com/avro-doc/spec.html


Thanks!

- No code generation.  The schema is all in JSON files that are parsed
  at runtime.  For Python, this is probably fine.  I'm not really clear
  on how it looks for Java (maybe someone can look at the Java tests and
  explain it to the rest of us).  For C++, this will definitely make
  the avro objects feel clunky because you'll have to access properties
  by name.  And the lists won't be statically typed.

For C++ we'll probably implement code generation in Avro. Java alreadyincludes code generation as an option. Code generation isn'tprohibited, it's just optional. My guess is that it will only beimplemented in Avro for C/C++ and Java.

Also, you need not access properties by name. For example, the readerfor generated Java code maintains an int->int mapping of remote fieldsto local fields, and fields are accessed by integer. This iseffectively what you must do in any generated code: you need a switchstatement that maps a field id to the line of code which sets the field.In Thrift and Protocol buffers, the remote field id is in the data,while in Avro its instead in the schema.

- The full schema is included with the messages, rather than having
  field ids delimit the contents.  This is nice for big Hadoop files
  since you only include the schema once.  (It was a technique that
  we discussed for Thrift.)  For a system like (I guess) Hadoop that
  has long-lived RPC connections with multiple messages passed, I guess
  it is not that big of a deal either.  For a system like we have at
  Facebook where the web server must connect to the feed/search/chat
  server once for each RPC, it is a killer.

This can be optimized by instead passing the hash of the schema, andfaulting if the other side has not previously seen that schema, sendingit on demand. I've not yet had time to completely specify and implementthis approach yet, but I think it addresses your concern here. Thefundamental requirement is only that the server and client somehow havecopies of each other's schemas, not that they exchange them with eachmessage or connection. This is why the handshake has a version number,to permit different mechanisms here. The first one is the simplest.


Doug

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Reply via email to