It sounds like what you want is the option avoid pre-generated
classes. If that's the only thing you need, it seems like we could
bolt that on to Thrift with almost no work. I assume you'd have the
schema stored in metadata or file header or something, right? (You
wouldn't want to store the field names in the binary encoding as
strings, since that would probably very quickly dwarf the size of the
actual data in a lot of cases.)
If my assumptions are correct, it seems like it'd be a lot smarter to
leverage existing Thrift infrastructure and encoding work rather than
duplicating it for this lone feature.
-Bryan
On Apr 3, 2009, at 9:06 AM, Doug Cutting wrote:
Owen O'Malley wrote:
2. Protocol buffers (and thrift) encode the field names as id
numbers. That means that if you read them into dynamic language
like Python that it has to use the field numbers instead of the
field names. In Avro, the field names are saved and there are no
field ids.
This hints at a related problem with Thrift and Protocol Buffers,
which is that they require one to generate code for each datatype
one processes. This is awkward in dynamic environments, where one
would like to write a script (Pig, Python, Perl, Hive, whatever) to
process input data and generate output data, without having to
locate the IDL for each input file, run an IDL compiler, load the
generated code, generate an IDL file for the output, run the
compiler again, load the output code and finally write your
output. Avro rather lets you simply open your inputs, examine
their datatypes, specify output types and write them.
Avro's Java implementation currently includes three different data
representations:
- a "generic" representation uses a standard set of datastructures
for all datatypes: records are represented as Map<String,Object>,
arrays as List<Object>, longs as Long, etc.
- a "reflect" representation uses Java reflection to permit one to
read and write existing Java classes with Avro.
- a "specific" representation generates Java classes that are
compiled and loaded, much like Thrift and Protocol Buffers.
We don't expect most scripting languages to use more than a single
representation. Implementing Avro is quite simple, by design. We
have a Python implementation, and hope to add more soon.
Doug